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SECTION  I 


SOFTWARE  RELIABILITY 

1.1  INTRODUCTION 

1.1.1  The  Need  for  Reliable  Software 

It  should  be  clear  that  reliable  software  is  needed  in  an  application 
such  as  an  automatic  digital  flight  control  system.  The  success  of  the 
mission  and  the  very  lives  of  the  crew  are  obviously  dependent  on  the  ability 
of  the  computer  system  to  operate  free  of  any  catastrophic  error.  Hence, 
before  we  can  entrust  men  and  machines  to  such  a system,  we  need  a means  of 
evaluating  the  reliability  of  the  computer  in  terms  of  both  hardware  and 
software.  Much  work  has  been  done  in  the  area  of  hardware  reliability,  but, 
to  date,  relatively  little  has  been  accomplished  in  the  area  of  software 
reliability.  As  a result,  there  is  a need  to  develop  techniques  which  will 
provide  assurance  of  the  reliability  of  a software  package  or,  better  yet, 
to  develop  a reliability  model  which  will  be  suitable  to  predict  the  relia- 
bility of  the  software  package  at  a given  point  in  time.  In  this  report  we 
examine  the  techniques  and  the  models  which  have  been  proposed  to  date  and 
discuss  their  appropriateness  and  viability. 

^he  high  cost  of  software  today  is  another  reason  why  reliable  soft- 
ware is  important.  For  example,  Boehm  estimates  that  in  1972  the  Air  Force's 
expenditure  on  software  was  between  $1  and  1.5  billion  compared  to  $300-400 
million  on  computer  hardware  [13].  By  the  late  1970's  it  is  expected  that 
software  will  represent  at  least  80  percent  of  the  total  outlay. 

Even  with  large  infusions  of  money,  software  reliability  can  be  a 
real  problem.  Shelley  [110]  relates  the  following  experience  with  the  Apollo 
program: 

"The  world's  most  carefully  planned  and  generously  funded  software 
program  was  that  developed  for  the  Apollo  series  of  lunar  flights.  The 
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effort  attracted  some  of  the  nation's  best  computer  programmers  and  in- 
volved two  competing  teams...  In  the  aggregate,  about  $600  million  was 
spent  on  software  for  the  Apollo  program.  Yet,  almost  every  major  fault 
of  the  Apollo  program,  from  false  alarms  to  actual  mishaps,  was  the  direct 
result  of  errors  in  computer  software. " 

1.1.2  The  Software  Reliability  Problem 

Dickson  et.al.  [30]  have  defined  software  reliability  as  "the  prob- 
ability that  a given  software  program  operates  for  some  time  period,  without 
a software  error,  on  the  machine  for  which  it  was  designed  given  that  it  is 
used  within  design  limits".  We  will  loosely  regard  a software  error  as  a 
failure  of  the  program  to  perform  as  expected.  These  software  errors  can  be 
categorized  as  either  errors  in  design  or  implementation  [92].  Design  errors 
include  misinterpretation  of  program  specifications  and  incorrect  problem 
formulation,  while  implementation  errors  include:  typographical  errors,  logic 
errors,  poor  algorithm  approximations,  untested  singularities  or  critical 
values,  and  misinterpretation  of  language  constructions. 

It  is  essential  that  we  realize  that  a piece  of  software  is  unreliable 
if  and  only  if  it  contains  errors.  Myers  [86]  states  this  point  succinctly: 
"Although  it  is  reasonable  to  want  to  express  software  reliability  in  relation 
. to  time,  one  should  recognize  that  it  is  not  really  a function  of  time.  Soft- 

ware reliability  is  a function  of  the  number  of  errors,  the  severities  and 
location  of  those  errors  and  the  way  in  which  the  system  is  being  used." 

Software  reliability,  then,  differs  from  hardware  reliability  primarily 
in  that  software  does  not  degrade  with  time  as  a result  of  environmental 
stress  or  fatigue  (age).  Because  software  does  not  degrade  with  time  and 
since  error  correction  eliminates  the  possibility  of  encountering  an  error 
again,  software  exhibits  reliability  growth.  That  is,  once  an  error  is 
; detected  and  successfully  corrected,  the  program  is  more  reliable  than  it 

♦ was  before.  Hence,  the  error  or  failure  rate  of  software  declines  as  errors 
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are  corrected.  Coutinho  [22]  emphasizes  this  important  point:  "It  cannot 
be  taken  for  granted  that  a deficiency,  once  discovered,  will  be  fixed,  or 
that  the  corrective  action  is  efficient.  Unless  a deficiency  is  identified, 
isolated,  and  effectively  corrected,  there  is  no  reliability  growth".  Pikul 
and  Wojcik  [92]  list  the  following  additional  differences  between  hardware 
and  software: 

(1)  Duplication  of  software  does  not  introduce  new  errors. 

(2)  A comprehensive  failure  modes  and  effects  analysis  is  impractical 
because  of  the  large  number  of  distinct  logic  paths. 

(3)  The  correction  of  a software  fault  alters  the  configuration  of  the 
software  and  eliminates  any  possibility  of  its  reoccurrence. 

(4)  There  is  no  standardized  approach  for  exhaustively  testing  software 
in  order  to  assume  it  meets  all  operational  requirements. 

1.1.3  The  Software  Development  Process 

As  a rough  model  of  the  general  behavior  of  the  software  error  rate 
over  the  software  life-cycle,  we  propose  the  following  four-phase  model. 

(see  Figure  1).  Phase  I is  the  design  and  initial  testing  period  in  which 
equations  and  algorithms  are  designed,  flowcharts  are  prepared,  modules  are 
written,  tested  and  integrated  by  the  manufacturer.  Errors  encountered  in 
this  phase  include  typographical  errors,  syntax  errors  and  the  more  blatant 
logic  errors.  This  phase  concludes  with  the  end  of  initial  system  integra- 
tion testing.  Phase  II  is  the  "advanced"  testing  state  where  the  program  is 
tested  as  a whole,  being  subjected  to  inputs  which  are  presumed  to  be  typical 
of  the  inputs  to  be  encountered  by  the  user.  This  phase  includes  initial  user 
testing  and  concludes  with  the  software  being  placed  in  operation.  Phase  III 
is  an  operational  phase  in  which  limited  maintenance  is  performed,  correcting 
only  those  errors  which  it  is  cost-effective  to  remove.  Phase  IV  is  an 
operational  stage  where  no  maintenance  is  performed. 
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FIG.  1 -PHASES  OF  SOFTWARE  DEVELOPMENT 


In  Phase  I the  error  rate  is  expected  to  be  quite  high  and  may  be 
quite  erratic.  As  a result,  data  accumulated  during  this  phase  is  not 
expected  to  be  useful,  if  at  all  meaningful.  As  the  integrated  system  is 
more  extensively  tested  during  Phase  II,  the  occurrence  rate  of  errors  is 
expected  to  become  smoother  and  decline  as  debugging  effort  continues 
(i.e.  as  errors  are  detected  and  eliminated).  The  error  rate  will  continue 
to  decline  during  Phase  III  and  will  in  fact  begin  to  level  off.  It  is 
during  Phase  II  and  Phase  III  that  we  can  obtain  the  most  useful  information 
(data)  for  developing  an  error  model.  It  is  in  these  phases  that  we  can  get 
a picture  of  the  true  error  process.  These  are  the  critical  stages  in  our 
model-building  process  and  good  data  is  required. 

It  is  important  to  note  here  that  we  are  implicitly  assuming  that 
the  number  of  errors  which  is  detected  in  a time  interval  and  the  collection 
of  error  counts  over  a series  of  time  intervals  can  be  modeled  by  a random 
variable  and  stochastic  process.  That  is  to  say,  the  time  between  the 
detection  of  errors,  the  time  between  the  detection  and  correction  of  an 
error,  and  the  number  of  errors  detected  or  corrected  in  a time  interval 
are  all  random  variables  [109].  However,  as  Jelinski  and  Moranda  point 
out  [56],  software  errors  do  not  appear  to  be  random  in  any  active  sense; 
there  is  no  physical  mechanism  for  their  generation.  However,  looking  at 
their  "passive"  roles  instead  of  their  "active"  roles,  the  assumption  of 
randomness  makes  more  sense.  Errors  are  "passive"  in  that  they  can  be 
regarded  as  a data/software  interaction  [67],  A program  may  work  perfectly 
with  data  from  a subset  of  the  space  of  all  possible  inputs,  but  fail  when 
a data  point  comes  from  a particular  region  of  this  input  space.  Since 
the  data  stream  can  be  regarded  as  a random  process  it  is  natural  to  consider 
failures  of  the  software  as  also  constituting  a random  process. 

In  Phase  IV  errors  tend  to  occur  infrequently;  however,  these  errors 
may  be  difficult  to  trace,  identify,  and  eventually  correct.  Thus,  some 


errors  may  never  be  corrected  either  because  they  cannot  be  identified  or 
cost-effectively  eliminated.  In  addition,  we  can  reasonably  expect  the 
error  correction  process  to  not  always  be  completely  successful.  Indeed, 
the  repair  action  may  not  successfully  remove  the  error  and  may  even  cause 
new  errors  to  be  generated.  As  a result  of  these  factors,  we  expect  the 
error  rate  in  Phase  IV  to  be  essentially  constant. 

Of  course,  we  do  not  expect  this  error  generation  process  to  be  local 
to  Phase  IV.  Any  time  we  attempt  to  correct  an  error  we  introduce  the 
possibility  of  unsuccessful  correction  or  new  error  generation.  Belady  and 
Lehman  [9]  describe  this  process  with  what  they  call  a "model  of  fault 
penetration"  (see  Figure  2).  They  show  that  each  time  software  undergoes 
attempted  correction,  a fraction  E of  total  errors  is  removed  (extracted) 
and  new  errors  G are  generated.  Thus,  after  each  attempted  correction, 
a new  composition  of  faults  appears  that  consists  of  residual  R and  generated 
G errors.  In  this  way,  the  authors  show  that  the  complexity,  or  unstruc- 
turedness of  the  system  increases  as  repairs  are  made,  thus  increasing  the 
difficulty  of  maintaining  the  system.  Obviously  the  models  and  techniques 
which  include  a mechanism  for  handling  the  possibility  of  the  introduction 
of  new  errors  will  have  a distinct  advantage  over  those  that  do  not. 

In  Sections  1.2  thru  1.6  we  will  present,  discuss,  and  evaluate  all 
the  techniques  and  models  we  have  encountered  which  make  an  attempt  to  determine 
and/or  predict  software  reliability.  The  discussion  and  evaluation  will  be 
carried  out  due  to  the  fact  that  we  are  concerned  with  the  software  associated 
with  an  automatic  digital  flight  control  system. 


1.2  COMPLETE  TESTING  AND  PROOF  OF  CORRECTNESS 

In  this  section  we  will  consider  the  two  ways  which,  at  first  glance, 
are  the  most  "intuitively  obvious"  methods  of  certifying  that  a piece  of 
software  is  true  and  accurate.  Then,  after  more  careful  investigation,  we 


FIG.2-PRIMITIVE  MODEL  OF  "FAULT  PENETRATION 


will  see  why  neither  of  them  is  really  useful  in  our  effort. 


The  first  method  is  simply  to  check  all  possible  logical  paths 

through  the  program.  This  sounds  very  straightforward  and  clear.  If  all 

paths  through  the  program  are  correct,  then  the  entire  program  will  be 

correct.  The  problem  lies,  however,  in  the  number  of  paths  that  would  have 

to  be  tested.  There  may  very  well  be  a great  many  of  these  paths  in  a 

typical  piece  of  software.  Consider  the  flowchart  in  Figure  3.  This 

flowchart  represents  a rather  simple  program,  and  yet  the  number  of  possible 
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logic  paths  is  about  10  , which  makes  exhaustive  testing  impossible  even 

using  a computer  [15].  Needless  to  say,  more  complicated  programs  (e.g. 
digital  flight  control  system  software)  will  only  compound  an  already  absurd 
situation.  Therefore,  checking  all  possible  paths  is  not  at  all  feasible. 

This  leads  us  to  the  second  "evident"  method  of  demonstrating  perfect 
software,  and  that  is  to  perform  a complete  proof  of  correctness  on  the  pro- 
gram. Since  this  method  avoids  the  drudgery  of  checking  all  paths,  it 
appears  at  first  to  be  more  intuitively  appealing.  Such  a methodology  was 
first  proposed  by  Floyd  and  Naur  and  has  come  to  be  known  as  the  assertion 
method  or  informal  proof  method.  Basically  this  method  associates  each 
input  line  of  a program  with  an  input  assertion  and  each  output  line  with 
an  output  assertion.  The  input  assertion  expresses  any  constraints  on  the 
input  variables,  and  the  output  assertion  expresses  the  desired  relation 
among  output  variables  if  the  program  terminates.  A number  of  intermediate 
lines  are  associated  with  other  assertions  that  express  the  relationship 
among  all  program  variables.  Provided  the  program  terminates,  the  program 
can  be  proved  correct  (partial  correctness)  if  for  each  procedure  the  input 
assertion  together  with  the  intervening  code  implies  the  output  assertion. 

A separate  proof,  utilizing  the  intermediate  assertions,  is  required  for 
program  termination.  A program  that  has  partial  correctness  together  with 
proof  of  termination  is  said  to  be  totally  correct. 
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There  have  been  a number  of  implementations  of  and  extensions  to  the 
assertion  method  in  recent  years.  One  extension  combined  the  partial 
correctness  proof  and  the  termination  proof  into  one  single  proof,  but  with 
no  associated  reduction  in  complexity.  Another  variation  utilizes  formal 
logic  and  certain  correctness  theorems.  This  variation,  referred  to  as 
formal  proof,  relies  on  algorithms  to  transform  the  proof  of  correctness 
into  the  proof  of  a theorem  in  the  first  order  predicate  calculus  [75]. 

While  a significant  amount  of  work  is  being  done  to  implement,  extend, 
and  improve  upon  the  initial  effort  of  Floyd  and  Naur,  the  proof  of  correct- 
ness techniques  are  not  as  yet  applicable  to  large  scale  software  packages 
such  as  in  an  automatic  flight  control  system.  The  process  is  long,  tedious 
and  error  prone.  It  is  a difficult  exercise  even  for  highly  skilled  pro- 
grammers and  requires  a thorough  familiarity  with  the  program  being  proved. 
Just  determining  the  input/output  assertions  can  be  a very  arduous  and 
challenging  task.  Proof  of  correctness  has  only  been  realistically  applied 
to  relatively  short  programs  without  an  inordinate  amount  of  complexity. 
Indeed,  the  longest  program  to  our  knowledge  which  has  been  proved  using 
these  methods  contained  only  about  100  lines.  Research  into  the  automation 
of  the  program  proofs  is  continuing,  but  results  are  still  in  a stage  of 
early  development  and  problems  remain. 

Proof  of  correctness  is  most  amenable  to  application  to  logical 
properties  and  algorithms.  Other  proper t ies (e . g . round  off  error)  are 
more  difficult  to  handle  and  many  have  not  been  adequately  addressed  as  yet. 
Consequently,  it  is  possible  to  prove  a program  "correct"  but  then  find  the 
program  is  not  perfectly  reliable  in  actual  use.  Misinterpretation  of 
intended  use  cannot  be  demonstrated  by  proof  of  correctness  techniques 
either  [75],  Therefore  more  practical  means  of  evaluating  the  reliability 
of  large  scale  software  are  required. 
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While  proof  of  correctness  is  not  an  adequate  solution  to  the  software 
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reliability  question  we  are  pursuing,  it  has  far  more  promise  than  the  total 
testing  mentioned  earlier,  and  is  an  area  where  further  research  may  prove 
beneficial  in  the  future. 

1.3  SEEDING  AND  TAGGING 

An  important  aspect  of  software  reliability  is  the  determination  of 
the  total  number  of  errors  in  a given  program.  Such  a number,  coupled  with 
the  number  of  errors  discovered  and  removed  to  dace,  would  give  a good 
means  of  interpreting  the  present  status  of  the  reliability  of  the  software. 
The  user  and/or  developer  could  make  a more  informed  decision  as  to  whether 
the  software  is  acceptable  enough  or  whether  the  debugging  should  continue 
and  how  much  time  and  effort  needs  to  be  expended. 

Two  possible  methods  of  determining  the  total  number  of  software 
errors  in  a program  are  "seeding"  and  "tagging".  The  peculiar  names  come 
from  the  standard  wildlife-conservation  practice  of  estimating  the  size  of 
animal  populations  thru  "tagging"  a captured  sample  of  animals  from  the 
population,  releasing  ("seeding")  the  sample  back  into  the  population, 
capturing  a second  sample,  and  counting  the  number  of  tagged  animals. 

Assuming  the  second  sample  is  representative  of  the  entire  population,  it 
is  possible  to  estimate  the  total  population  by  comparing  ratios.  Feller 
discusses  the  technique  applied  to  animal  populations. 

With  respect  to  software  there  have  been  two  variations  proposed  to 
date.  The  first  is  due  to  Mills  and  Hyman  [138]  and  they  suggest  "seeding", 
i.e.  deliberately  inserting  a certain  number  of  errors  into  a program  before 
debugging  is  to  begin.  The  person  doing  the  seeding  should  obviously  not 
do  the  debugging  as  well.  A number  of  errors  will  be  discovered  during 
debugging,  some  of  which  will  presumably  be  "seeded"  errors.  Thus,  an 
estimate  of  the  total  number  of  original  errors  in  the  program  can  be 

obtained.  The  problem  with  this  procedure  lies  in  the  difficulty  of  inventing 
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and  placing  errors  that  are  similar  enough  to  the  errors  in  the  program  so 
that  the  inserted  errors  are  as  likely  to  be  discovered  as  the  original 
errors.  Without  knowing  anything  about  the  errors  in  the  program,  this 
would  be  a difficult  job  and  would  introduce  more  variability  into  the  final 
results.  For  a large  software  system  the  number  of  "seeded"  errors  would 
have  to  be  quite  large  also,  and  placement  of  the  errors  becomes  more 
problematic  as  well. 

Rudner  [119,120]  has  done  the  most  work  in  this  area  and  has  made  the 
most  progress.  She  proposes  a second  variation  of  the  technique,  and  her 
work  appears  to  have  the  most  promise.  This  method  called  "tagging"  in- 
volves two  debuggers  working  independently  on  a piece  of  software  with  an 
unknown  total  number  of  errors,  N.  The  first  debugger  discovers  t errors 
which  can  be  considered  the  "tagged"  bugs.  The  second  debugger  discovers 
s errors,  which  can  be  considered  the  sample.  The  two  sets  of  discovered 
errors  have  a certain  number  of  errors,  c,  in  common.  The  ratio  of  c to  s 
should  be  approximately  equal  to  the  ratio  t to  N,  assuming  that  all  errors 
have  an  equal  chance  of  being  discovered  and  no  new  errors  are  introduced 

* 3^ 

during  debugging.  Thus,  an  estimate  of  N can  be  determined:  N = — 

A statistical,  formal  approach  would  be  to  consider  c as  a discrete 
random  variable  with  a hypergeometric  distribution: 

p (c  | s , t , N ) = (c)  (s  - c) 

(g) 
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The  mean  of  this  distribution  is  — and  it  follows  that  the  maximum 

N 

likelihood  estimator  of  N,  call  it  N^,  will  be  the  largest  integer  less  than 

— Note  that  N = N . 
c.  o 

Rudner  also  develops  a method  whereby  more  than  two  debuggers  can 

be  utilized  to  obtain  another  estimate,  N ,,  with  a smaller  variance  than 

od 
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N . Confidence  limits  can  be  determined  for  N and  N ,,  the  latter  being 
o o od  ° 

narrower  than  the  former  as  expected.  Still  another  estimate,  call  it  N^, 
is  proposed  which  not  only  has  a smaller  variance  than  but  also  a lower 
bias . 

The  two  assumptions  mentioned  earlier  (equal  probability  of  discovery 
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of  each  error  and  the  non-  introductioijj^f  new  errors)  may  possibly  be 
unrealistically  restrictive.  The  second  assumption  has  not  been  eliminated 
from  the  model  as  yet.  However,  Rudner  has  developed  a model  modification 
for  possible  replacement  of  theSKrst  assumption.  The  modification  basically 
divides  errors  into  as  many  different  categories  of  discovery  difficulty  as 
required  and  proceeds  to  estimate  the  number  of  errors  in  each  category 
rather  than  just  the  aggregate  error  total  N. 

The  techniques  and  estimators  developed  by  Rudner  appear  to  have 
enough  promise  and  appeal  to  warrant  further  investigation  and  research. 

The  various  estimations  of  N have  a much  sounder  basis  than  many  of  the 
other  models  we  have  studied.  Furthermore,  the  extra  capability  of 
determining  confidence  limits  makes  the  technique  even  more  attractive. 
Application  of  the  techniques  to  an  actual  debugging  situation  should  prove 
most  interesting  from  an  evaluation  and  validation  standpoint.  It  may  be 
necessary,  however,  to  modify  the  model  to  eliminate  the  assumption  that 
no  new  errors  are  introduced  during  the  removal  of  a discovered  error. 

Another  important  consideration  in  the  application  of  the  techniques  will  be 
the  increased  cost  of  having  two  or  more  debuggers  working  independently.  With 
only  one  test  procedure,  the  cost  of  testing  is  already  a major  expense  in  the 
development  of  a piece  of  software,  and  it  will  be  necessary  to  determine  if 
the  "tagging"  techniques  will  be  cost  effective  in  the  determination  of 


software  reliability. 
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1 . 4 STRUCTURAL  MODELS 

Most  work  on  probabilistic  modeling  of  software  reliability  has 
treated  the  program  as  a black  box,  and  attempted  to  model  its  outward 
behavior.  That  work  is  described  under  Analytical  Models,  Section  1.6, 
following.  This  section  discusses  several  efforts  that  have  been  made  to 
look  inside  the  box  and  incorporate  knowledge  of  the  structure  of  a program. 
It  should  be  pointed  out  that  all  the  structural  models  give  an  estimate  of 
reliability  for  a particular  software  element  at  a specific  time.  They 
do  not  model  the  reliability  improvements  that  occur  over  time.  The  general 
function  of  these  models  is  to  combine  existing  estimates  of  sub-routine  or 
module  reliability  with  structural  information  to  devise  an  estimate  of  the 
reliability  of  the  entire  program. 

1.4.1  Shopman  Structural  Model 

In  [112]  Dr.  M.L.  Shooman  describes  a path  oriented  structural  model. 

He  assumes  first  that  a program  can  be  decomposed  into  a number  of  paths  or 

cases.  Which  path  or  paths  are  executed  is  dependent  on  the  data  input  to 

the  program.  It  is  also  assumed  that  the  identification  of  paths  will  be 

done  at  a high  enough  level  to  yield  a relatively  small  number  of  caseSj 
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small  at  least  with  respect  to  the  at  least  10  unique  paths  that  can  be 
identified  in  most  programs.  For  each  path,  the  model  requires  that  three 
parameters  be  estimated.  The  paths  are  identified  by  N tests  that  between 
them  will  cause  all  paths  to  be  executed.  The  parameters  are  t^,  the  time 
required  to  run  case  i;  q^,  the  probability  of  an  error  on  a run  of  case  i; 
and  f^,  the  relative  frequency  with  which  the  path  represented  by  case  i 
is  executed.  Using  these  parameters  Shooman  derives  an  expression  for  the 
system  failure  rate: 
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This  is  the  average  probability  of  failure  on  one  test  case,  over  the 
mean  run  time  for  a single  case. 

In  his  paper.  Dr.  Shooman  admits  the  massive  difficulties  in  estimating 


the  parameters.  With  extensive  special  code  in  the  program,  actual  runs  can 
capture  data  for  relative  frequency  and  run  times.  Alternatively  he  suggests 
assuming  uniform  usage  of  paths  and  use  of  an  average  run  time.  He  makes  no 
useful  suggestions  regarding  q^,  the  probability  of  failure  for  each  path. 

Halstead  [49]  attempts  to  relate  various  aspects  of  computer  software 
to  E,  the  number  of  elementary  mental  discriminations  needed  to  produce  the 
program.  E is  considered  an  accurate  measure  of  the  effort  expended.  He 
computes  E from  such  factors  as  operator  and  operand  vocabulary  and  program 
length.  Funami  [41]  recently  applied  software  physics  to  debugging  data 
reported  by  Akiyama  [1].  Funami  reports  an  excellent  correlation  between  E 
and  the  total  number  of  bugs  present  for  the  nine  modules  described  by  Akiyama. 
If  a precise  relationship  can  be  discovered,  software  physics  may  allow  the 
direct  estimation  of  Shooman's  parameter  q^,  the  path  error  probability.  The 
availability  of  such  a technique  is  many  years  away  however,  and  methods  will 
need  to  be  developed  to  account  for  variations  in  language  and  between 
programmers . 

1.4.2  Llttlewood  Markov  Model 

Littlewood  [69]  proposes  a model  that  more  accurately  reproduces 
program  structure.  Each  section  of  coding  in  a particular  program  is 
identified  as  a state.  The  execution  of  the  program  can  then  be  considered 


♦ a series  of  transfers  from  state  to  state.  If  the  probability  of  each  transfer 


occurring  is  independent  of  how  long  the  system  has  been  running,  the  system 


can  be  simulated  by  a Markov  process.  The  parameters  for  this  are  N states 


and  N“  probabilities  of  the  form  q^..  These  are  the  probability  that  when  in 


state  i the  system  will  transfer  to  state  j.  In  order  to  include  those 


errors  that  occur  in  interfaces  Littlewood  introduces  i , the  probability 


of  a failure  occurring  during  the  transfer  from  state  i to  state  j.  The 


probability  of  an  error  occurring  while  executing  in  state  i is  v^.  In 


his  paper,  Littlewood  outlines  techniques  for  determining  the  probability 


distribution  for  time  to  failure  as  long  as  the  X^.'s  and  v.'s  are  very 
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In  a later  paper  Littlewood  [70]  introduces  several  additional 


parameters.  F.^  is  a probability  function  that  determines,  before  a transfer 


from  state  i to  state  k,  how  long  the  program  will  remain  in  state  i.  This 
changes  the  transition  process  from  Markov  to  semi-Markov.  The  other 
additional  parameter  is  y ^ , the  cost  of  a failure  in  state  y,  Littlewood 
then  suggests  methods  for  calculating  the  cost  of  running  the  program  and 
the  overall  failure  probability. 

As  with  Shooman's  structural  model,  the  level  of  detail  depicted  and 
consequently  the  number  of  states  must  be  selected  so  as  to  limit  the  amount 
of  computation  required.  If  a system  with  25  modules  or  sub-routines  were 
being  modeled  at  a module  level,  more  than  1300  numeric  parameters  would  be 


required.  Again,  the  structural  parameters,  the  q . . ' s and  F^'s  could  be 


estimated  by  adding  many  counters  and  other  special  code  to  a program.  The 
effect  of  so  many  changes  on  the  program's  reliability  and  timing  is  question- 
able however.  As  with  Shooman's  model,  Funami's  work  may  in  time  allow 


estimation  of  the  vj's»  the  probability  of  failure  in  each  state.  Because 


the  states  can  be  identified  with  modules  or  subroutines,  unlike  in  Shooman's 
model,  it  may  be  possible  to  use  data  from  early  in  development  when  the 
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modules  are  tested  separately.  It  must  be  remembered  though  that  this 

. I 

model  does  not  project  any  growth  in  reliability  with  time.  Both  Littlewood 
models  are  extremely  difficult  to  solve  numerically.  For  these  reasons  we 
feel  further  study  would  not  be  productive  at  this  time. 

1.4.3  Green  Model 

Thomas  Green  has  developed  a model  [46,47]  of  program  debugging  which 
can  be  used  to  estimate  the  effect  of  program  structure  on  debugging.  The 
program  is  described  as  a set  of  nodes,  arcs,  and  loops  (Figure  4).  Param- 
eters describe  the  probability  that  when  the  program  reaches  a particular 
node  each  arc  will  be  taken,  and  how  many  times  each  loop  will  be  executed. 

The  nodes  are  branch  points  and  the  arcs  are  sections  of  coding.  A computer 
simulation  then  can  randomly  distribute  errors  among  the  arcs  and  simulate 
execution  with  various  inputs.  This  can  reveal  the  effect  of  program 
structure  on  the  effectiveness  of  a sequence  of  randomly  selected  tests. 

Green's  results  to  date  have  confirmed  several  intuitive  assumptions 
about  the  difficulty  of  testing  all  parts  of  a program  and  the  tendency  of 
errors  to  "hide"  in  little-used  sections  of  a program. 

1.5  EMPIRICAL  MODELS 

< 

In  the  following  sections,  we  present  a number  of  approaches  to 
software  reliability  which  do  not  attempt  to  model  the  actual  error 
detection/collection/generation  process.  These  model0  do  not  hypothesize 
a particular  hazard  function  or  a particular  reliability  function,  but  attempt, 
instead,  to  either  select  a particular  function  based  upon  empirical  data  or 
relate  reliability  statistically  to  certain  software  characteristics.  These 
so-called  "empirical  models"  approach  the  problem  of  software  reliability 
from  a statistical  viewpoint,  rather  than  a mathematical  or  analytical  view- 
point . 
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FIG.  4 -DIRECTED  GRAPH  REPRESENTATION  OF  A PROGRAM 
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1.5.1  Schneidewind  Methodology 

A major  objective  of  statistical  or  empirical  software  models  is  to 
predict  software  reliability  based  on  test  program  data.  Specifically, 
Schneidewind  [108]  investigates  methods  for  estimating  the  amount  of  test 
time  required  to  satisfy  program  reliability  requirements.  Then,  a re- 
liability function  can  be  obtained  if  a theoretical  density  function  f(t) 
can  be  fitted  to  the  empirical  data  of  time  between  errors.  Furthermore, 
confidence  limits  may  be  derived  for  the  theoretical  reliability  function 
parameters . 

Schneidewind  also  has  proposed  the  utilization  of  statistical  testing 
techniques.  First,  analysis-of-variance  (ANOVA)  may  be  used  in  testing  for 
differences  between  and  among  programs.  Hence,  the  results  may  be  used  to 
identify  the  major  contributors  of  program  reliability  variability.  Second, 
goodness-of-f it  tests  can  be  implemented  to  specify  the  type  of  distribution 
which  is  applicable  to  software  failures.  The  general  approach  proposed  by 
Schneidewind  is  summarized  below: 

(1)  Tentatively  select  a reliability  function  based  on  the  shape  of 
the  frequency  function  of  the  empirical  data, 

(2)  Estimate  the  parameters  of  the  reliability  function, 

(3)  Use  goodness-of-f it  tests  to  identify  the  reliability  function, 

(4)  Estimate  the  reliability  function  parameters  confidence  limit, 

(5)  Estimate  the  reliability  function  confidence  limit, 

(6)  Predict  the  reliability  and  its  confidence  limit, 

(7)  Compare  the  predicted  reliability  with  the  required  reliability. 

This  appears  to  be  a very  reasonable  approach  and  could  perhaps  be 
used  in  conjunction  with  one  of  the  analytical  models.  It  is  intuitively 
appealing  because  of  its  logical  approach  and  its  applicability  to  a very 
general  class  of  reliability  models. 

19 


However,  implementing  the  aforementioned  approach  may  be  complicated 
by  the  fact  that  the  time  between  errors  or  the  number  of  errors  per  interval 
is  not  a stationary  process  with  respect  to  test  time.  As  a result  of  the 
reduction  in  the  error  rate,  the  form  of  the  distribution  may  change,  or  it 
may  remain  the  same  with  its  parameters  changing.  Nevertheless,  in  either 
case,  this  indicates  that  a reliability  function  which  is  based  on  the  total 
set  of  data  points  may  not  be  an  accurate  predictor  because  the  entire  set 
of  data  is  not  representative  of  the  current  error  process. 

If  the  distribution  remains  the  same  during  the  testing  process  and 
the  parameters  change,  Schneidewind  suggests  using  a smoothing  technique 
with  the  most  recent  data  points  to  derive  new  parameter  estimates  relative 
to  the  upcoming  time  interval.  Thus,  parameter  estimates  would  continue  to 
be  updated  as  the  testing  continues.  However,  if  the  distribution  does 
change,  a more  serious  problem  arises,  that  being  the  identification  of  the 
distribution  which  is  relevant  for  each  stage  of  testing.  Once  a suitable 
reliability  function  is  obtained  it  becomes  possible  to  estimate  the  lower 
confidence  limit  for  the  MTBF  and  thus  use  this  estimate  with  the  reliability 
function  to  obtain  the  reliability  lower  limit.  Finally,  the  estimate  may 
be  compared  with  the  specified  requirements  to  determine  whether  an  acceptable 
reliability  has  been  achieved. 

As  was  previously  mentioned,  the  reliability  function  of  one  stage  may 
not  apply  at  a future  time.  It  is  then  assumed  that  the  revised  reliability 
function  is  applicable  at  the  next  stage.  A change  in  the  type  of  reliability 
function  is  made  at  the  completion  of  a test  stage,  if  a noticeable  change 
occurs  in  the  distribution  of  time  between  errors. 

Schneidewind  concludes  his  discussion  by  noting  the  following  points: 
(1)  the  specifics  of  the  approach  will  probably  be  supplemented  by  an 
improved  model  which  is  now  under  development,  (2)  reliability  is  estimated 


to  satisfy  program  reliability  requirements,  and  (3)  important  factors 
affecting  software  reliability  and  its  accuracy  include  the  difference  of 
reliability  characteristics  among  programs  and  the  previously  discussed 
non-stationary  process  of  error  occurrence. 

1.5.2  Weibull  Model  (Proposed  by  Coutinho) 

A general  hardware  oriented  reliability  growth  technique  to  predict 
future  system  failures  is  to  utilize  past  data  and  fit  parameters  of  the 
Weibull  distribution  as  suggested  in  [22],  [123],  and  [124].  This  approach 
models  the  software  error  rate  as  an  explicit  function  of  time  and  not  as 
a function  of  the  number  of  errors  remaining.  Coutinho  notes  "when  errors, 
failures  or  stoppages  occur,  the  causes  must  be  found,  corrective  action 
taken,  and  the  test  re-run  to  demonstrate  that  the  corrective  action  is 
effective".  Specifically,  the  instantaneous  failure  rate  at  time  t is 
given  by  the  two  parameter  density  function: 

Z (t)  = (n/o)  (t/o)n  1 (Eq.  1) 

where  n = a shape  parameter  (Def.  1) 

o = a scale  parameter  (Def.  2) 

Coutinho,  in  [22],  further  states  "the  error  rate  declines  rapidly 
during  the  debugging  period  until  it  levels  off  and  reaches  a residual 
error  rate  which  continues  to  exist  in  large,  complex  debugged  programs". 
The  MTTF  and  reliability  equations  are  as  follows: 


MTTF  = 

[ a r ( 1/ n) ] /n 

(Eq. 

2) 

R(t)  = 

exp  [-( t/a) n ] 

(Eq. 

3) 

where  T is  the  gamma  function 


The  following  inputs  are  necessary  to  utilize  this  model. 


I 


i 


(1)  The  number  of  errors  in  each  time  interval. 

(2)  The  time  between  each  time  interval  (CPU  Time). 

(3)  The  total  number  of  time  intervals. 

(4)  The  cumulative  number  of  errors  found  to  date. 

Estimators  of  the  shape  and  scale  parameters  may  be  derived  using 
plotting  techniques,  the  method  of  moments  and  maximum  likelihood. 

This  approach  seems  reasonable  due  to  the  wide  range  of  possible 
shapes  that  the  Weibull  distribution  can  assume.  Also,  the  Weibull  dis- 
tribution does  reflect  our  assumption  of  reliability  growth  and  is  probably 
the  most  relevant  distribution  to  use. 

Thus,  this  technique  was  applied  to  three  different  data  sets  to 
determine  the  usefulness  and  accuracy  of  such  an  approach.  First,  Coutinho 
[22]  applied  the  Weibull  technique  to  the  "army  field  test  data"  and  his 
results  (graphs)  are  very  questionable.  The  data  set  under  consideration 
was  36  weeks  in  duration,  however  his  results  appear  to  be  based  on  the  last 
19  weeks  (weeks  18-36)  solely.  He  seems  to  have  discarded  the  first  17  weeks 
from  his  analysis,  yet  he  doesn't  present  any  supporting  evidence  as  to  why 
this  was  done.  Nevertheless,  the  line  "fitted"  to  his  data  cannot  be 
derived  unless  only  the  last  half  of  the  data  is  used. 

A second  question  pertaining  to  his  analysis  also  arises.  The  data 
points  Coutinho  plotted  on  his  graph  (reference  [22],  Figure  l)  were  not 
the  number  of  errors  discovered  per  week  but  was  the  cumulative  average  number 
of  errors  per  week.  Thus,  the  initial  points  plotted  will  be  weighted  more 
heavily  than  later  occurring  ones.  Naturally,  as  the  sample  size  increases, 
the  error  data  (actual  number  of  errors  discovered  per  week)  will  have  less 
impact  on  the  cumulative  average  and  any  fluctuations  will  be  minimal.  Finally, 
when  the  error  data  is  plotted  on  log-log  paper  (3x3  cycles)  any  deviations  will 
be  extremely  difficult  to  observe. 
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After  observing  Coutinho's  results,  we  applied  this  method  to  two 
other  sets  of  data.  First,  Sukert's  data  [124]  was  analyzed  and  grouped 
into  weekly  intervals  prior  to  application  of  this  model.  On  the  log-log 
paper,  the  points  plotted  are  reasonably  similar  to  Coutinho's  in  that  the 
initial  points  exhibit  erratic  behavior  and  only  upon  reaching  the  12th 
week  of  the  test  period  does  a trend  begin  to  appear.  Thus,  we  would  be 
forced  to  throw  out  the  first  11  weeks  of  error  data  which  we  cannot  justify. 
However,  we  continued  and  attempted  to  fit  a line  to  the  data  points  (both 
actual  number  of  errors  per  week  and  the  cumulative  average  number  of  errors 
per  week)  using  simple  linear  regression.  Again,  in  this  application,  the 
results  were  not  encouraging. 

The  other  set  of  data  that  we  applied  this  Weibull  method  to  consisted 
of  only  seven  data  points  [30].  In  [22]  Coutinho  states  that  you  need  a 
sufficient  number  of  data  points  before  a meaningful  analysis  can  be  prepared. 
Due  to  the  small  data  set  it  was  difficult  to  clearly  identify  a trend.  How- 
ever, the  regression  line  derived  was  a much  better  fit  than  those  obtained 
in  the  two  previous  applications  of  this  method. 

Coutinho  comments  on  his  application  "the  initial  portions  of  the 
£D/t  (cumulative  average  number  of  errors  detected  per  week)  curve  for  both 
samples  are  similar  and  do  not  plot  on  a straight  line  as  expected,  but  exhibit 
a wave  form.  This  is  caused  by  peculiarities  of  the  test  plan  which  results 
in  deficiencies  being  initially  discovered  in  an  irregular  order".  The 
above  statement  may  have  some  validity  to  it;  however,  we  must  also  consider 
the  fact  that  due  to  the  cumulative  averaging  technique,  early  data  points 
are  weighted  more  heavily  than  later  ones.  Thus,  it  would  appear  as  though 
even  with  the  most  erratic  error  data  one  could  eventually  project  a "fitted 
line"  as  the  sample  size  becomes  sufficiently  large.  Due  to  this  apparent 
deficiency,  we  do  question  the  use  of  cumulative  averaging  in  software  re- 
liability studies.  Furthermore,  Coutinho  fails  to  mention  why  only  weeks  18-36 
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were  used  in  his  analysis.  In  light  of  all  the  problems  and  questionable 
procedures  observed,  we  feel  that  such  a technique  will  not  serve  as  an 
appropriate  predictor  in  analyzing  the  software  portion  of  an  automatic 
digital  flight  control  system. 

1.5.3  Corcoran-Meingar ten-Zehna  Model 

A model  developed  in  [21],  more  applicable  to  hardware,  focuses  on 
the  observed  performance  of  an  item  and  hence  estimates  the  current  re- 
liability. The  approach  presented  differs  considerably  from  the  models 
developed  exclusively  to  predict  software  reliability. 

Initially  it  should  be  noted  that  this  particular  model  ignores  the 
time  of  the  test  and  only  considers  the  outcome  of  N trials.  This  appears  to 
be  a limiting  factor  on  the  applicability  of  the  proposed  model;  however, 
further  investigation  is  in  order.  Below  are  some  of  the  symbolic  definitions 
given  by  Corcoran,  Weingarten  and  Zehna: 

N = the  total  number  of  tests  (Def.  1) 

= number  of  observed  successes  in  N tests  (Def.  2) 

Hence,  the  current  reliability  after  N tests  may  be  given  as  N^/N. 

N-N  = the  number  of  failures  in  N tests  (Def.  3) 

o 

Now,  a primary  objective  is  to  identify  some  of  these  failures,  take 
corrective  action,  and  calculate  the  new  reliability.  It  was  proposed  that 
after  corrective  action  had  been  taken  N'  additional  tests  be  conducted 
with  N'^  being  the  number  of  additional  successes  noted.  Finally,  a new 
reliability  estimate  may  be  computed  by  N'  /N'. 

A listing  and  evaluation  of  the  model  assumptions  is  presented  below. 

(1)  It  is  assumed  that  a given  test  may  result  in  success,  with  an  unknown 
probability  p^  or  one  of  K failure  modes,  with  the  unknown  probability 
of  failure  type  i given  as  q^. 

(2)  The  N tests  to  be  performed  are  assumed  to  be  independent. 
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(3)  By  fixing  K,  it  is  assumed  that  no  additional  failure  modes  are 
generated  when  corrective  action  is  taken. 

(4)  No  corrective  action  is  taken  until  all  N tests  have  been  executed. 

(5)  It  is  assumed  that  when  a failure  occurs  it  can  be  categorized  as  to 
type. 

(6)  If  a type  i failure  is  observed,  it  is  assumed  that  there  is  a known 
probability  a^  of  removing  that  failure  by  corrective  action. 

(7)  a^,  a0,  a^  are  conditional  probabilities. 

The  first  assumption  seems  reasonable  and  implies  the  following: 

k 

I’o  + z q±  - 1.0  (Ec>-  !) 

i=l 


Assumption  3 is  the  first  one  which  seems  questionable  in  that  there  is  a 
distinct  possibility  that  the  corrective  action  taken  may  in  fact  introduce 
additional  failures.  The  sixth  assumption  also  causes  concern  since  a 
known  probability  of  correction  is  assumed.  Realistically,  it  would  seem 
extremely  difficult  to  accurately  quantify  the  probability  of  successful 
correction.  Finally,  assumption  7 presents  no  problems  statistically  and 
may  be  interpreted  as: 

P(correction  | failure)  = a^  (Eq.  2) 


Since  failures  are  defined  by  type,  the  following  should  prove  helpful 
throughout  the  remainder  of  this  section. 


= the  number  of  observed  failures  of  type  i. 


N + y N. 

o 1 i 


i-1 


(Def.  4) 
(Eq.  3) 


Current  reliability  may  now  be  computed  as: 


k 

p + E 

° i=l 


yi  qi 


(Eq.  4) 


where  y 


i 


0 if  N, 


0 


at  if  N.  > 0 
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Then,  the  mean  reliability  is: 


u 


i=l 


l-a-q±) 


N 


(Eq.  5) 


Then,  using  the  estimates  of  the  parameters  in  equation  4: 

p.  = N /N  + ?!  a.N./N  (Eq,  6) 

lo  L i i ^ 

i=l 

In  [21]  other  estimators  are  presented  similar  to  equation  6 and  thus, 
are  not  presented  here. 

Finally,  Corcoran  et.al.  state  "with  regard  to  our  model,  we  feel 
that  we  must  stress  the  obvious  and  remark  that  our  estimators  are  not 
better  than  the  probabilities  (ap  of  successful  corrective  action  which 
are  assumed".  Furthermore,  they  mention  that  their  proposed  model  does  have 
limited  applicability.  And  although  this  model  does  predict  reliability 
it  does  not  estimate  the  number  of  errors  or  times  between  errors.  Thus, 
in  light  of  these  limitations  and  the  fact  that  test  time  is  totally  ignored, 
we  believe  that  such  a model  will  not  address  the  overall  software  re- 
liability problem  adequately. 


1.5.4  Weiss  Model 

The  Weiss  model  was  developed  to  predict  a mean  time  between  failure 
(MTBF)  curve,  based  on  a given  number  of  trials,  and  compare  it  with  the 
actual  MTBF's.  The  following  assumptions  are  given  in  [126]. 

(1)  The  exponential  distribution  is  assumed  for  the  times  to  failure. 

(2)  It  is  assumed  that  there  are  M possible  sources  of  failure. 

(3)  The  failure  rate  is  different  for  each  source  of  failure. 

(4)  When  a failure  occurs,  there  is  a probabilitv  (P  < 1)  that  the  failure 

c 

is  corrected. 

(5)  It  is  assumed  that  MTBF  for  the  system  changes  by  a constant  percent 
each  trial. 
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(6)  For  each  of  the  M sources  of  failure,  this  type  of  error  may  occur 


several  times  in  the  software. 

(7)  For  some  error  types,  once  the  error  is  discovered,  there  is  a 
high  probability  that  all  other  occurrences  of  this  type  will  be 
removed  at  once. 

(8)  For  other  error  types,  a relatively  small  fraction  of  their  occur- 
rences elsewhere  in  the  software  will  be  detected  and  corrected  at 
once . 

The  first  three  assumptions  seem  fairly  sound  and  do  not  pose  any 

problems.  However,  assumption  4 requires  that  each  P be  estimated  from 

c 

previous  data.  It  would  appear  that  subjective  judgment  would  play  a 
major  role  in  establishing  the  various  values.  Quantifying  P^'s  is 
a formidable  task  and  is  similar  to  the  process  we  question  in  Corcoran's 
et.  al.  model. 

Weiss  states  that  a standard  test  program  is  developed,  and  the  system 

is  tested  until  it  fails  (time  to  failure  = T ) or  until  some  maximum  time 

m 

(T  ) is  reached  when  the  test  is  then  terminated.  After  the  completion  of 

testing,  design  changes  are  made  and  incorporated  into  the  system,  and  the 

testing  process  resumes.  Then  after  n systems  have  been  tested,  n^  have 

been  terminated  bv  failure  (T  ),  and  n are  terminated  when  a specified 

m t 

operating  time  (T  ) is  reached.  Finally,  it  has  been  assumed  that  MTBF, 
[T(i)],  and  its  reciprocal  a(i)  are  functions  of  the  number  of  the  trial, 
i,  and  certain  parameters  that  depend  on  the  reliability  growth  process 


assumed.  Symbolically,  this  may  be  interpreted  as: 


r 


Now,  the  density  function  for  the  sample  resulting  from  the  test  of  n 
systems  is: 


= [ Tia(i)]  A 11  exp  [ -£  a(i)  T (i) 


n 


In  <J>  = £ In  a(i)  - £ a(i)  T (i)  - n.  In  A 

n n m t 


where  A = a very  small  increment  of  time 


(Eq.  3) 


(Eq.  4) 


(Def.  1) 
(Def,  2) 

If  aj,  a2,....,am  are  determined  via  maximum  likelihood  the  set  of  equa- 


a(i)  = T(i)"1 


tions  31n  ft 
9a . 


= 0 


are  obtained. 


£ 91n  a(i)/  9a.  -£  T (i)9a(i)  / 9a.  = 0 

nc  J n m J 


ao  _ f-  — 1 

T = fQ  te  3 a dt  = a 1 


(Eq.  5) 

The  above  equation  may  be  applied  to  obtain  the  M.L.E.  of  aj,  a2, ,am. 

Weiss  then  presents  the  case  where  it  is  assumed  that  the  probability 
that  the  system  will  fail  in  At  is  aAt  with  a being  a constant.  Then  the 
probability  of  no  failure  in  t is: 

<fr(t)  = (1  - aAt)t/At  (Eq.  6) 

“ 3 1 

As  At  approaches  zero,  equation  6 goes  to  e . Thus,  MTBF  is: 

(Eq.  7) 

Hence,  if  there  are  M sources  of  failure,  then  the  general  form  of 
equations  6 and  7 becomes: 

(Eq.  8) 
(Eq.  9) 

In  the  application  made  by  Weiss  "unexpectedly  close  agreement 
between  the  derived  curve  and  the  original  can  be  reasonably  explained  on 
the  grounds  that  this  is  only  one  example;  that  if  the  example  had  been 
repeated  a number  of  times,  poorer  agreement  would  have  been  obtained  on 
the  average".  It  is  not  clear  as  to  whether  the  data  Weiss  used  was  actual 


4>(t)  = exp  [-£a . t] 
m J 


T = (£  a ) 
m J 


-1 
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error  data  or  data  generated  through  simulation.  Thus,  the  true  utility 
of  this  model  is  not  known,  and  although  others  have  not  actively  pursued 
this  approach,  it  may  require  further  investigation. 


1.5.5  Weibull  Model  (Proposed  by  Wagoner) 

The  negative  exponential  distribution,  probably  the  most  widely  used 
probability  function  in  reliability  analysis,  is  briefly  discussed  by  Wagoner 
[134].  He  states  that  one  assumption  of  this  distribution  is  that  the 
failure  rate  is  constant  which  greatly  restricts  its  applicability.  Further- 
more, it  is  implied  that  the  probability  of  failure  is  solely  dependent  on 
the  length  of  the  time  interval  and  independent  of  what  hour  of  operation 
the  unit  under  consideration  is  in.  This  does  not  appear  to  be  true,  especially 
in  the  initial  phases  of  a program,  when  software  errors  occur  quite  frequently 
and  the  operating  times  are  relatively  short.  Thus,  a distribution  with  a de- 
creasing failure  rate  would  appear  to  be  more  appropriate. 

The  Weibull  distribution,  due  to  its  allowance  for  an  increasing,  de- 
creasing, or  constant  failure  rate  satisfies  the  aforementioned  criteria. 

Wagoner  also  cites  Coutinho  and  Hudson  as  having  applied  the  Weibull  distri- 
bution to  model  the  software  error  detection  process.  However,  these  authors 
used  calendar  time  as  the  independent  variable  in  their  analyses.  This 
approach  (supported  by  four  applications  to  data)  yielded  values  for  the 
shape  parameter  in  excess  of  unity.  Hence,  this  would  imply  that  the  failure 
rate  increases  with  time.  Wagoner  states  that  if  this  were  the  case,  programs 
would  never  achieve  operational  status  since  the  number  of  software  failures 
per  unit  time  would  increase  the  longer  the  program  was  run.  Thus,  he  hypo- 
thesizes that  choosing  CPli  time  as  the  independent  variable  would  be  a better 
choice.  Supporting  this  statement  is  the  fact  that  calendar  time  does  not 
account  for  the  significant  increase  in  running  time  per  week  which  often 


occurs  as  a program  nears  operational  status.  Secondly,  calendar  time  will 


! 


not  reflect  the  time  periods  when  a program  is  not  run  at  all.  Thus,  CPU 
time  should  be  a better  measure  than  calendar  time.  The  forthcoming  ap- 
proach, proposed  by  Wagoner,  was  applied  to  a set  of  data  and  in  that  par- 
ticular case,  produced  good  results.  However,  Wagoner  carefully  notes  that 
although  the  Weibull  distribution  yielded  accurate  results  in  this  one 
application,  similar  results  might  not  be  obtained  with  other  sets  of  data. 

Below,  equations  1 and  2 give  the  probability  density  function  and 

the  cumulative  distribution  function  respectively: 

n , t , n ' r , t . n , 

- (->  exp  [ - (-)  ] 

0 elsewhere 

t n 

FT(t)  = 1 - exp  l - (--)  ] 


(Eq.  1) 


(Eq.  2) 


where 


a = a scale  parameter 
q = a shape  parameter 
t = the  cumulative  debugging  time 

F (t)  = the  cumulative  fraction  of  errors  detected  at  time  t 


(Def.  1) 
(Def.  2) 
(Def.  3) 
(Def.  4) 


Thus,  it  may  be  observed  that  for  n>l  the  failure  rate  increases  with  time, 
for  n<l  the  failure  rate  decreases,  and  for  n=l  the  failure  rate  is  constant. 
For  the  third  case  (n=l)  the  Weibull  distribution  reduces  to  the  negative 
exponential  distribution,  which  was  previously  discussed. 

If  it  is  assumed  that  the  cumulative  number  of  failures  (errors) 
discovered  is  distributed  as  a Weibull  function,  then  the  actual  error  data 
may  be  compared  to  the  model  predicted  values.  Now,  Wagoner  suggests  im- 
plementing a transformation  to  linearize  the  cumulative  distribution  function 
in  order  to  simplify  the  parameter  estimation  process.  Initially,  the  cum- 
ulative distribution  function  may  be  restated  as: 

1 - FT(t)  = exp  [ - (^jO  ] (Eq.  3) 
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(Eq.  4) 


1 , ,t4 

l - MO  ' " cxp  1 V 1 


Then , 


In  In 


1 - FT(t) 


= nine-  n In  o 


(Eq.  5) 


or,  the  equation  for  a straight  line 

y = mx  + b (Eq.  6) 

Defining  the  variables  in  equation  6 in  terms  of  the  Weibull  distribution 
gives: 

v = In  In  I 


1 - FT(t) 


(Def.  5) 

m = n (Def.  6) 

x = In  t (Def. 7) 

b = - n In  a (Def . 8) 

Utilizing  the  method  of  least  squares  in  conjunction  with  the  soft- 
ware error  data  gives  estimates  of  the  model  parameters  n and  o.  Then  a 
fitted  line  may  be  derived  and  plotted  on  semi-log  paper.  Cumulative  CPU 
time  represents  the  x-axis  and  the  detected  error  function  corresponds  with 
the  y-axis. 

Although  Wagoner  states  that  the  Weibull  distribution  successfully 
models  the  particular  case  that  he  analyzes,  there  do  appear  to  be  some 
unanswered  questions.  First  of  all,  with  the  data  set  that  Wagoner  used, 
he- assumed  that  the  total  number  of  errors  in  the  program  was  known.  We 
strongly  question  this  assumption  and  are  concerned  with  the  applicability 
of  the  Weibull  distribution  in  the  event  that  the  total  number  of  errors  is 
not  known  or  cannot  be  accurately  estimated.  Secondly,  whereas  the  Weibull 
distribution  did  produce  favorable  results  in  Wagoner's  case,  the  overall 
utility  of  this  approach  cannot  be  assessed  until  additional  applications 
are  made. 

At  this  time,  we  find  some  assumptions  to  be  quite  questionable  and 
will  focus  our  attention  on  models  that  appear  to  be  more  promising. 
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1.5.6  Nelson  Model 


This  model,  developed  at  TRW,  examines  the  actual  output  of  a program 
and  compares  it  with  the  desired  output  with  the  resulting  deviation  being 
recorded.  Based  on  this  type  of  approach  reliability  may  then  be  predicted. 

Initially,  the  following  definitions  are  presented. 

p = a computer  program  which  gives  a computable 

function  F on  the  set  E (Def.  1) 

E = E^  for  i = 1,2, ,N  (values  of  the  input  variables)  (Def.  2) 

Thayer  et.  al.  in  [128]  then  note  that  executing  p produces  F(E^)  for  each 
input  E^.  Hence,  due  to  imperfections  in  implementing  p,  p actually  specifies 
a function  F'  which  differs  from  the  desired  function  F.  Therefore, 


F(E^)  = the  desired  output  (Def.  3) 

F'(E^)  = the  actual  output  (Def.  4) 

Then,  if 

Ai  = an  acceptable  output  tolerance  (Def.  5) 

| F ’ (E.)  - F ( E ^ ) | <_  Ai  (acceptable  deviation)  (Eq.  1) 

| F* (E±)  - F(E^)|  > Ai  (unacceptable  deviation)  (Eq.  2) 


It  is  further  stated  that  for  equation  2,  the  E^  comprise  a subset  Eg  of  E, 
producing  the  unacceptable  results.  In  addition  to  equation  2 yielding 
unacceptable  output,  it  is  also  possible  that  (1)  execution  may  terminate 
prematurely,  or  (2)  execution  may  fail  to  terminate. 


Then, 


If, 

P = probability  that  a run  will  result  in  an  execution 
failure 

n = the  number  of  E.  in  E 
e i e 


Let  R = the  probability  of  no  execution  failures  in  a run  of 

n 


R = 1 - P = 


1 


e 

N 


P 


(Def.  6) 
(Def.  7) 

(Eq.  3) 

(Def.  8) 
(Eq.  4) 
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It  should  be  noted  that  equations  3 and  A are  based  on  the  assumption 
that  each  is  equally  likely.  However,  the  authors  state  that  in 

operational  use,  the  inputs  are  most  probably  not  equally  likely.  Instead, 
they  may  be  chosen  in  conjunction  with  some  operational  requirement. 

The  P may  be  expressed  in  terms  of  (a  probability  distribution) 
by  defining  an  execution  variable  y^. 


yi  = 


Hence, 


and 


[0  if  the  run  with  E^  produces  acceptable  results 

ll  if  the  run  with  E yields  an  execution  failure 
i 


y‘ 


n 

k = £ 


s p.d-yj 

1=i  1 1 


If  n runs  are  performed,  we  have: 

R(n)  = Rn(l-P)n 


(Def.  9) 


(Eq.  5) 


(Eq.  6) 


(Eq.  7) 


Furthermore,  if  the  inputs  are  chosen  in  some  definite  sequence,  then 

Pj^  = the  probability  that  E^  is  chosen  as  the  input  on  the  j—  run  (Def.  10) 


So  the  probability  that  run  j results  in  a failure  is 
n 

pj  ->■  pjiyi 

Thus,  the  reliability  for  n runs  is: 

n 

R(n)  - (1-P1)  (1-P2) (1-Pn)  - ^(l-Pj) 


or  in  exponential  form: 

n 

R(n)=  exp  ^ ln(l-P  ) 
i-1  J 


(Eq.  8) 


(Eq.  9) 


(Eq.  10) 


R(n)  may  be  expressed  in  terms  of  t,  the  execution  time,  by  utilizing  the 
following  approach. 


33 


(Def.  11) 


In  summation,  we  have  found  no  major  flaws  with  this  model.  Although 
probability  of  failure  and  reliability  are  addressed,  predicted  times  between 
failure  or  number  of  errors  cannot  be  determined.  Thus,  for  our  particular 
problem,  this  model  does  not  appear  to  be  appropriate  in  that  many  aspects 
associated  with  it  are  concerned  with  problems  that  we  are  not  currently 
addressing. 


Wall  and  Ferguson  [135]  initiate  their  discussion  of  software  relia- 
bility by  stressing  the  importance  of  identifying  model  parameters  and  using 
available  software  error  data  to  estimate  the  current  reliability  as  well  as 
reliability  at  some  future  point  in  time.  Then  it  would  be  possible  to  identify 
particular  programs  which  are  not  performing  up  to  specified  standards. 

They  also  claim  that  the  analytical  models  that  have  been  proposed 
for  the  error  occurrence  rate  are  eloquent  yet  impractical.  These  models, 
in  their  opinion,  are  not  very  useful  to  the  system  designer  or  program 
manager  at  the  present  state-of-the-art  of  software  reliability  for  two 
reasons.  First,  "some  of  the  relationships  contain  a large  number  of 
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parameters  which  must  be  empirically  determined".  Our  search  of  the  literature 
has  produced  over  twenty  models,  none  of  which  contain  a "large"  number  of 
parameters.  In  fact,  many  of  the  models  contain  two  parameters,  which  may 
be  estimated  fairly  easily  by  utilizing  a Computer  program.  They  also  cite 
the  limited  amount  of  failure  data  available,  which  is  a valid  and  accurate 
statement.  Second,  they  claim,  these  analytic  relationships  do  not  seem 
to  fit  the  total  range  of  data  very  well. 

Wall  and  Ferguson  suggest  that  there  is  an  algebraic  relationship 
between  the  number  of  failures  and  the  maturity  of  the  software.  Specifically, 


they  give  the  equation: 

C = Co(M/Mo)“  (Eq.  1) 

where  C = cumulative  number  of  software  failures  experienced  (Def.  1) 
by  a "set"  of  software. 

Cq  = a constant  (Def.  2) 

a = a constant  (Def.  3) 

= a scaling  constant  (Def.  4) 

M = the  "maturity"  of  the  software  (Def.  5) 


The  constants,  a and  C^,  must  be  determined  empirically  according  to  the 
authors,  and  they  make  a general  statement  that  a lies  between  0.3  and  0.7 
for  a wide  range  of  program  types.  The  range  of  a is  extremely  large  and 
does  not  appear  to  be  as  useful  as  it  may  have  been  intended  to  be.  Never- 
theless, if  equation  1 is  differentiated  with  respect  to  time,  we  obtain  an 
expression  for  the  failure  rate. 

R = a C d(M/M  ) , M . ct-1  (Eq.  2) 

° — ; < jT  } 

dt  o 

Equation  2 may  be  rewritten  as: 

R = Ro  ( 5r}  a_1  (Eq-  3) 

o 

where  R = a constant  (Def.  6) 
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The  authors  do  present  plots  of  data  which  do  seem  to  support  their 


contention  that  there  is  reasonable  correlation  between  the  failure  data 

and  these  functions.  They  do  not,  however,  indicate  what  values  they  use 

for  the  constants  in  each  data  set,  nor  do  they  indicate  how  they  might  be 

estimated.  In  their  paper,  they  do  state,  "a  value,  or  even  a range  of  values, 

for  C and  R is  considerably  more  difficult  to  obtain  because  of  the  wide 
o o 

range  of  units  for  M,C,  and  R used  to  collect  and  report  failure  data".  It 

is  conjectured  that  they  used  a regression  technique  to  obtain  the  best 

fitting  curve  in  each  case.  As  was  stated,  a appears  to  lie  between  0.3  and 

0.7,  but  no  such  range  was  able  to  be  determined  for  C and  R . This  problem 

o o 

of  estimation  casts  doubt  on  the  usefulness  of  the  model.  We  anticipate  that 
a linear  regression  technique  (using  the  logarithms  of  the  proposed  functions) 
may  be  an  adequate  means  of  estimating  these  constants  for  a particular  model. 
However,  due  to  some  of  the  limitations  and  potential  problems  we  have  chosen 
not  to  actively  pursue  this  method  at  this  time. 

1.5.8  Regression  Models 

Various  regression  models  have  been  proposed  which  attempt  to  identify 
those  software  characteristics  which  influence  software  reliability  [65,106]. 
These  program  characteristics  include  the  number  of  statements  in  the  program, 
the  number  of  branches,  the  number  of  I/O  instructions  and  some  measure  of 
complexity.  Characteristics  of  this  type  reflect  the  nature  of  the  program. 
Other  characteristics  which  reflect  the  ability  of  the  programmer  include  his 
years  of  academic  experience,  his  years  of  operational  experience,  and  a 
supervisor's  rating.  Of  course,  identifying  such  characteristics  will  not 
enable  us  to  predict  a failure  rate  or  reliability  directly  at  any  point  in 
time,  but  they  do  allow  us  to  identify  those  principles  and  procedures  which 
lead  to  the  efficient  development  of  reliable  software.  Further,  if  we  can 
develop  a viable  prediction  equation,  we  will  be  able  to  estimate  the  number 
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the  programmer's  ability.  We  could  then  use  this  prediction  to  refine  our 
estimate  of  the  number  of  errors  that  is  derived  using  one  of  the  analytic 
models. 

Lipow  and  Thayer  [65]  present  a "non-standard  method  of  linear  re- 
gression analysis"  which  evaluates  predictors  for  the  numbers  of  software 
errors.  There  are  some  serious  questions  which  must  be  resolved  concerning 
the  validity  of  their  method,  which  they  themselves  admit  to  be  "very  non- 
standard". First,  they  constrain  the  regression  coefficients  of  the  predictor 
(independent)  variables  to  be  non-negative.  The  rationale  for  the  non-nega- 
tivity constraints  is  so  that  the  predictor  variables  do  not  have  a negative 
impact  on  the  number  of  software  errors.  Second,  t ey  assumed  a zero-inter- 
cept for  all  regression  equations.  This  second  method  is  particularly 
questioned  since  a non-zero  intercept  would  have  little  or  no  meaning  for 
their  range  of  data  and  tends  to  yield  higher,  but  non-comparable,  correlation 
coefficients  than  does  the  non-zero  intercept  model.  In  the  text  of  their 
paper  and  on  the  corresponding  graphs,  certain  points  were  designated  sta- 
tistical "outliers"  and  eliminated  from  their  analysis.  No  valid  reasoning 
was  presented  as  to  why  the  "outliers"  were  excluded,  however,  one  hypoth- 
esizes that  "better  results"  could  be  obtained  by  ignoring  such  points. 

Schick  and  Wolverton  [106]  present  a classical  least  squares  regression 


analysis  where  a best  fit  may  be  determined.  Scatter  plots  and  averaging 
techniques  were  also  used  to  determine  possible  relationships  in  the  data. 


Furthermore,  a goodness  of  fit  technique  was  utilized  which  is  described  below: 


o-  , . = variance  of  t e actual  values  of  the 

observed  , . ... 

dependent  variable 


(Def.  3) 


n = total  number  of  data  points 


(Def.  4) 


T = the  number  of  terms  in  the  regression  equation  (Def.  5) 

2 

If,  in  equation  1,  r approaches  1.0,  then  it  may  be  said  that  the 

estimating  equation  does  account  for  the  observed  variations  in  the  dependent 

2 

variable.  Conversely,  as  r approaches  0,  the  opposite  would  be  true. 

The  results  obtained  by  Schick  and  Wolverton  show  that  the  only 
significant  predictor  of  the  error  content  of  a program  and  the  cost  of 
developing  the  routine  is  the  size,  or  number  of  statements,  in  the 
routine.  The  effect  of  all  other  independent  variables  on  the  number  of 
errors  and  the  cost  of  the  routine  seems  to  be  negligible  although  a large 
portion  of  the  variation  in  the  data  remains  unexplained.  In  particular, 
programmer  experience  seems  to  have  little  effect  on  errors  or  costs.  They 
do  caution,  however,  that  only  the  most  intuitive  relationships  were  tested. 

The  two  regression  techniques  presented  sharply  contrasted  one 
another.  Lipow  and  Thayer  had  some  very  questionable  steps  in  their  approach 
and  analysis.  Schick  and  Wolverton  utilized  a much  more  common  linear  re- 
gression analysis.  However,  it  does  not  appear  that  the  relationship  between 
program  characteristics  or  programmers  ability  and  the  reliability  of  the 
software  will  be  of  any  great  value  in  predicting  the  number  of  errors  in  a 
program. 


1 . 6 ANALYTICAL  MODELS 


Thayer,  Craig,  and  Frey  [128]  define  a "software  reliability  model" 
as  a mathematical  model  constructed  for  the  purpose  of  assessing  the  re- 
liability of  software  from  specified  parameters  which  are  either  assumed 
known  or  are  estimated  from  observable  data.  We  shall  refer  to  models  of 


this  type  as  "analytical  models"  since  they  describe  mathematical  relation- 
ships between  the  parameters  and  reliability.  That  is,  reliability  (or 
equivalently,  failure  rate)  is  depicted  as  a function  of  these  parameters. 

After  reviewing  other  reliability  techniques  (Sections  1.2  and  1.3)  and  other 
types  of  models  (Sections  1.4  and  1.5)  we  feel  that  the  analytical  models  offer 
the  most  realistic  and  accurate  way  of  determining  and  predicting  software 
rel iabi 1 i ty . 

The  first  analytical  software  reliability  models  to  our  knowledge  were 
developed  in  1971,  and  we  have  found  a total  of  fifteen  developed  since  then. 

We  have  tried  to  carefully  study,  apply,  and  evaluate  all  of  them.  In  this 
section  we  present  all  fifteen  models  and  include  a summary  of  each  model,  a 
discussion  of  the  assumptions,  an  explanation  of  how  to  use  the  model,  and 
finally  a preliminary  evaluation  of  each  model  relative  to  the  intended  use 
of  the  software  (i.e.  in  a digital  flight  control  system).  In  this  pie- 
liminary  evaluation  eleven  of  the  models  for  various  reasons  are  deemed  in- 
appropriate for  use  in  our  efforts.  The  remaining  four  (Jelinski-Moranda, 
extended  Jelinski-Moranda,  Schneidewind , and  Geometric)  models  show  some 
definite  promise  and  will  be  studied  and  evaluated  in  greater  detail  in 
Section  1.7. 


» 


l) 


1 6.1  Jelinski-Moranda  Model 

One  of  the  earlier  software  error  models  was  proposed  by  Jelinski 
and  Moranda  and  may  be  shown  to  be  equivalent  to  the  Shooman  exponential 
model  [56],  [79].  This  so  called  "de-eutrophication"  model  makes  the  same 
basic  assumptions  as  does  the  Shooman,  and  the  two  are  nearly  identical,  with 
the  exception  that  Jelinski  and  Moranda  do  not  explicitly  include  the  number 
of  machine  language  instructions,  although  it  is  assumed  that  the  number 
remains  relatively  constant.  The  parameters  associated  with  the  Jelinski- 
Moranda  model,  N and  0,  are  considered  fixed  for  the  individual  program  and 
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are  estimated  via  maximum  likelihood  (of  course  these  estimates  may  change 
as  errors  are  discovered).  Jelinski  and  Moranda  define  the  following 
notation  for  their  model: 


N = 

the  total  number  of  initial  errors  in  the  program 

(Def . 

1) 

0 = 

a proportionality  constant 

(Def. 

2) 

X.  = 

l 

the  length  of  the  i—  debugging  interv^J.  (the  time 
between  detection  of  the  (i-1) — and  i — errors)( 

(Def . 

3) 

i = 

the  number  of  errors  discovered  (equals  the 
number  of  intervals)  ' 

(Def. 

4) 

Now,  the  hazard  function  during  the  i — debugging  interval  (after  the 
detection  of  the  (i-1)—  error,  hut  before  the  detection  of  the  i—  error  is 

Z(X.)  = 0 [N-(i-l) ] (Eq.  1) 

Thus,  in  this  manner,  the  hazard  function  Z(X^)  is  defined  piecewise  over 
time  (see  Figure  5) . 

Forman  [38]  examines  this  particular  model  in  detail  and,  assuming 
that  the  model  parameters  are  known,  derives  expressions  for  the  number  of 
bugs  to  be  removed  before  the  software  is  deemed  acceptable.  The  distribu- 
tion of  the  time  to  successfully  debug  the  system  is  given,  and  bounds  are 
also  given  for  this  distribution.  When  the  model  parameters  are  not  known 
Forman  presents  three  methods  for  their  estimation;  maximum  likelihood, 
least  squares  and  pseudo  Bayesian. 

Furthermore,  Forman  also  attempts  to  expand  the  de-eutrophication 
model  to  incorporate  the  rate  of  error  generation  during  the  debugging 
process.  He  reports  that  good  parameter  estimates  are  difficult  to  attain 
unless  the  software  is  completely  or  nearly  completely,  debugged.  As  such, 
he  concludes,  this  model  does  not  appear  to  be  applicable  during  the  entire 
debugging  process,  but  "can  be  used  to  answer  the  question  of  whether  or  not 
all  bugs  have  been  removed  when  it  is  believed  debugging  has  been  completed". 


40 


h 


ERROR  RATE 


FAILURE  RATE  IS  PROPORTIONAL  TO  THE 
NUMBER  OF  ERRORS  REMAINING. 


The  basic  assumptions  corresponding  to  the  Jelinski-Moranda  de-eutro- 
phication model  are: 

(1)  There  is  a fixed  number  of  errors  in  the  program. 

(2)  No  new  errors  are  added  during  the  debugging  process. 

(3)  The  failure  rate  is  proportional  to  the  current  error  content 
(number  of  remaining  errors). 

(4)  It  is  implicitly  assumed  that  each  error  has  an  equal  chance  of 
being  detected. 

(5)  Each  error  discovered  is  immediately  removed. 

(6)  The  failure  rate  between  errors  is  constant. 

Myers  [86]  states  "It  (the  model)  makes  many  assumptions  and  all  of 
them  are  questionable".  However,  we  feel  that  most  of  the  aforementioned 
assumptions  are  fairly  reasonable.  The  only  assumptions  which  might  be 
questioned  are  that  no  new  errors  are  added  during  debugging  and  secondly, 
that  each  error  is  equally  likely.  Assumption  5 could  also  be  interpreted 
as  being  that  the  program  will  not  be  used  until  a detected  error  is  corrected. 
Finally,  this  model  also  makes  the  assumption  that  the  program  is  not  being 
altered  except  for  error  correction. 

Now  it  is  possible  to  proceed  with  the  estimation  of  the  model  param- 
eters, N and  0 (definitions  1 and  2).  As  previously  mentioned,  maximum 
likelihood  will  be  utilized  with  this  model.  In  definition  3 we  stated  that 
was  the  length  of  the  i—  debugging  interval  and  its  density  may  be  given  as 

P(X.)  = 0 [N-(i-l)]  exp  {-0 [N- ( i-1) ] X.)  (Eq.'  2) 

Now  the  corresponding  likelihood  function  may  be  written  as 

L(X.,X, X ) = " 0 [N-(i-l) ] exp  {-0[N-(i-l)]  X,}  (Eq.  3) 

l l n 1=1  i 

where  n = the  total  number  of  errors  discovered  to  date  (Def.  5) 

(equals  the  number  of  intervals). 
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Taking  the  logarithm  of  the  likelihood  function  yields: 


log  L = E log  { [ N— ( i— 1 ) ] 0}  - I (N— ( i— 1 ) ) 0 X 


(Eq.  4) 


log  L = E log  { N-(i-l)}  + n log  0 - E (N-(i-l)  0 X 


(Eq.  5) 


Taking  partial  derivatives  with  respect  to  N and  0 and  then  setting  the 
resultant  equations  equal  to  zero  gives: 


nog  L _ e 1 e A Y n 

3N  " i=l  N-(i-l)  “ i=l  ® X1"  ° 


(Eq.  6) 


31og  L _ n 
30  '0 


(N-(i-l))  X = 0 

1=1  i 


(Eq.  7) 


Next,  by  letting  j;X^=  T we  can  solve  equation  7 for  0 and  obtain 


NT  - Z (i-D  X. 


(Eq.  8) 


where  T = the  cumulative  debugging  time 

(total  number  of  intervals) = EX^ 

Then,  equation  6 becomes 


(Def.  6) 


E - 1 

i=l  N-(i-l) 


N - I Z (i-1)  X. 
T J = 1 i 


(Eq.  9 ) 


Since  0 is  no  longer  present  in  equation  9 solving  this  equation  becomes 

the  next  step.  The  two  data  derived  statistics  are  T = £X  and  £(i-l)X 

i ' i 

and  knowing  these  we  can  expand  equation  9 and  then  solve  for  N 


1 1 
N N-l  + 


N-  (n-l) 


(Eq.  10) 


N - i E (i-1)  X 

4=1  1 


The  estimate  for  N,  called  N,  can  be  utilized  to  obtain  an  estimate  for  0. 


NT  - E (i-1)  X 


(Eq.  11) 
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rewriting  the  right  hand  side  of  equation  10  yields 


_L-  + + 

N N-l 


N - (n-l) 


(Eq.  12) 


where  P = 


jEj  (1-1)  Xt 


(Def.  7) 


Then  multiplying  both  sides  of  equation  12  by  (N-P)  gives 


N-P  N^P 

N N-l 


N-(n-l) 


— — —v 

N _ P N-J_  P-1 

N " N N-l  N-l 

-)  v_  -J 


(Eq.  13) 


Simplifying  equation  1A  results  in 


P P-1 

1 - — + 1 - — — + 
N N-l 


.+  1 - 


N-(n-l) 

N-(n-l) 


P-(n-l) 

N-(n-l) 


P- (n-l)l  _ 
N- (n-l) 


(Eq.  1A) 


(Eq.  15) 


The  above  then  reduces  to 


l + + 

N N-l 


+ ?zSs=ll  m o 

• N-(n-l)  U 


(Eq.  16) 


The  parameters  of  the  Jelinski-Moranda  model  can  be  estimated  using  equations 
9-12.  Essentially,  equations  9,  10  and  12  are  equal  with  different  notation 

A 

(substituted)  and  are  used  to  find  N.  Equation  11  provides  an  estimate  for  0. 
The  only  data  required  for  the  de-eutrophication  model  is  the  sequence 

of  times  between  errors  (i.e.  X^,  X2,  X^»  xn)» 

This  model  has  been  selected  for  further  study  in  section  1.7. 

1.6.2  Basic  Shopman  Model 

One  of  the  earliest  software  reliability  models  was  proposed  by 
Dr.  Martin  L.  Shooman  [116-118]  of  the  Polytechnic  Institute  of  New  York. 

In  1971  Shooman  developed  this  model  which  relates  the  probability  of  en- 


countering a software  error  to  the  number  of  errors  remaining  in  the  software. 


the  total  number  of  machine  language  instruction^  and  in  his  earlier  papers, 
the  instruction  processing  rate.  One  major  objective  was  to  examine  error 
behavior  during  debugging  which  posseses  some  generality  for  both  small  and 
large  programs,  and  then  relate  this  behavior  back  to  the  probability  of 
encountering  an  error. 

The  Shooman  model  may  be  applied  in  two  different  ways  and  a brief 
discussion  is  in  order.  First,  this  technique  may  be  utilized  through  a 
two  point  estimation  process.  This  approach  is  susceptible  to  varied  result 
(predictions)  depending  on  the  pair  of  estimation  points  chosen.  Further 
development  of  this  particular  method,  including  its  application  to  software 
error  data,  will  appear  later  in  this  section.  The  second  alternative  with 
Shooman's  model  is  a full  application  of  the  model,  and  this  approach  will 
be  examined  in  considerable  detail  following  a presentation  of  some  of  the 
basic  model  concepts. 

Initially,  we  will  establish  the  following  assumptions  that  Shooman 
specifies  regarding  his  exponential  model. 

(1)  There  is  a fixed  number  of  errors  in  a program. 

(2)  No  new  errors  are  added  during  debugging. 

(3)  The  error  detection  rate  (failure  rate)  is  proportional  to  the 
number  of  remaining  errors. 

(A)  Ttie  number  of  machine  language  instructions  is  constant. 

(5)  Operational  software  errors  occur  due  to  the  occasional 
traversing  of  a portion  of  the  program  in  which  a hidden 
software  bug  is  lurking. 

(6)  Implicitly  assumes  that  each  error  has  an  equal  chance  of 
being  detected. 

(7)  Implicitly  assumes  that  a relationship  between  operational  time 
and  debugging  time  can  be  determined. 


It  should  be  noted  that  Shooman  based  his  model  on  some  simplifying 
assumptions.  First,  let  us  evaluate  the  impact  of  the  assumptions  that  there 
is  a fixed  number  of  errors  in  the  program  and  that  no  new  errors  are  generated 
during  the  debugging  process.  Stating  that  no  new  errors  are  introduced 
during  debugging  implies  that  the  correction  procedure  is  perfect  and  is  not 
considered  to  be  a very  realistic  assumption.  Together,  tiie  two  aforementioned 
assumptions  infer  that  eventually  we  will  reach  the  point  where  software  is 
perfectly  reliable  — a conclusion  most  practitioners  will  be  unwilling  to 
make  no  matter  how  good  the  program  appeared  (for  a program  of  any  reasonable 
size) . 

It  has  been  further  assumed  that  the  error  detection  rate  is  proportional 
to  the  number  of  errors  remaining  in  the  program.  This  suggests  that  the 
failure  rate  is  constant  between  the  detection  of  consecutive  errors  and 
declines  as  an  error  is  detected  (see  Figure  5).  Implicit  in  this  are  the 
assumptions  that  software  errors  are  independent  of  each  other  and  that 
each  error  has  an  equal  chance  of  being  detected  at  a given  point  in  time. 

Hence,  these  assumptions  result  in  an  exponential  distribution  for  the  times 
between  error  occurrences.  Thus,  errors  occur  according  to  a Poisson  process 
whose  parameter  declines  whenever  an  error  is  detected. 

Although  these  assumptions  may  be  questioned  they  do  appear  to  be 
reasonable  for  approximate  error  behavior  in  Phases  II,  III  and  IV  (see 
Figure  1).  These  phases  were  discussed  in  Section  1.1.3  (The  Software 
Development  Process).  In  fact,  behavior  in  the  latter  two  stages  may  fit 
these  assumptions  quite  well.  Nevertheless,  these  particular  assumptions 
may  be  required  in  order  to  develop  a workable  model.  Intuitively,  however, 
we  can  appreciate  the  possibility  that  an  error  may  be  "hidden"  by  another 
error  which  has  not  been  detected.  Furthermore,  it  seems  plausible  that 
some  errors  will  be  more  subtle  than  others;  hence  some  errors  would  be  less 
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likely  to  be  detected.  Overall,  however,  the  equally  likely  assumption 


seems  to  have  merit. 

Dr.  Shooman  also  assumed  that  the  number  of  instructions  in  the 
program  remains  unchanged,  which  reflects  the  reasonable  assumption  that 
we  are  dealing  with  a relatively  "mature"  piece  of  software  (one  that  is 
basically  unmodified  except  for  error  corrections).  Finally,  he  also  as- 
sumes that  the  relationship  between  operational  time  and  debugging  time  or 
effort  is  known. 

Similar  to  many  of  the  analytical  models,  the  Shooman  model  can  be 
characterized  by  an  instantaneous  error  detection  rate  or  hazard  function 
Z(t).  As  previously  mentioned,  this  model  assumes  that  the  hazard  function 
is  proportional  to  the  number  of  errors  remaining  in  the  program.  Specifi- 


cally, the  Shooman  hazard  function  is  given  by: 

Z(t)  = C[er(T)l  (Eq.  1) 

Z ( t ) = C{[Et/It]  - ec(T)}  (Eq.  2) 

Z(t)  = C{[Et/It]  - [Ec(t)/It]}  (Eq.  3) 

Z(t)  = {[C/IT]  [Et-  Ec(t)]}  (Eq.  4) 

where  t = operating  time  of  the  system  measure  from  its  initial  (Def.  1) 
activation 


T = the  amount  of  debugging  time  since  the  start  of  system  (Def.  2) 


integration 

C = a proportionality  constant  (Def.  3) 

e (t)  = normalized  number  of  errors  remaining  at  time  t (Def.  4) 

r (Er(T)/IT  = (Et/It  - Cc(t)> 

Et  = the  total  number  of  errors  at  time  t = 0 (Def.  5) 

1^  = the  total  number  of  machine  language  instructions  (Def.  6) 

E (t)  = the  cumulative  number  of  errors  corrected  (detected)  (Def.  7) 

at  time  x 


t (x)  = normalized  number  of  errors  corrected  at  time  x =(E  (x)/I„)  (Def.  8) 

c c T 

E^.-Ec(x)  = the  number  of  errors  remaining  at  time  x (Def.  9) 


47 


h k 


~*-w 


Prior  to  discussing  the  method  for  estimating  C and  E^  it  should 
prove  helpful  to  discuss  two  final  assumptions.  First,  it  is  tacitly 
assumed  that  each  error  has  the  same  severity  (i.e.  all  errors  are  cat- 
astrophic). This  assumption  could  be  easily  removed  by  categorizing  the 
errors  by  severity  and  thus  estimating  a new  and  C for  each  of  the 
categories.  Secondly,  it  was  assumed  that  the  software  errors  were  cor- 
rected immediately  since  the  failure  rate  changes  whenever  an  error  is  de- 
tected. This  could  be  handled  by  either  not  using  the  program  until  the 
error  is  corrected  or  simply  not  counting  it  again  (which  is  a questionable 
procedure  practically,  since  it  does  assume  that  the  error  will  be  corrected 
and  gives  a distorted  view  of  the  true  error  rate). 

Let  us  now  examine  the  testing  programs  for  estimating  the  two 
unknowns,  C and  E^.  Initially,  the  software  failure  rate  is  defined  as: 

Xs  = Xg  / Hi  (Eq.  5) 

i i 

where  = software  failures  found  during  (Def.  10) 

i 

H.  = "locally"  cumulative  number  of  units  of  time  after  (Def.  11) 

debugging  time  . 


Then, 


MTTF  = H./X 
s , is, 

l l 


(Eq.  6) 


In  [117]  Shooman  gives  the  following  reliability  and  mean  time  to  failure 
equations: 

E-r 

(Eq.  7) 


R(t,t)  = exp  [ — C (- e (x))  t] 

T 


MTTF  (t)  = 


1 


TT  ‘ ec(T) 


(Eq.  8) 


where  x = the  debugging  time 


(Def.  12) 


t = operational  time  from  initial  activation 


(Def.  13) 


1 


The  unknowns  (C  and  E^,)  can  be  evaluated  by  running  a functional  test 
after  two  different  debugging  times,  and  chosen  so  that  and 

e (x  ) <e  (t 2)  • Then,  by  equating  the  mean  time  to  failure  equations 
(equations  6 and  8)  at  times  and  t^,  we  obtain  equations  9 and  10. 


CIV‘c<V: 


(Eq.  9) 


S C[^  - e (r  )] 

T 


(Eq.  10) 


Now  it  is  possible  to  estimate  E^  by  taking  the  ratio  of  equations  9 and 
10  and  then  applying  equation  5.  The  resulting  estimate  is: 


ET  = 


V £c(Ti> ' 6c( 


v 


(Eq.  11) 


(A  /A  ) - 1 

s2  si 


Next,  the  proportionality  constant  can  be  estimated  by: 


(Eq.  12) 


Finally,  the  times  between  failures  constitute  the  data  required  for  the 
Shooman  model.  Let  us  now  examine  the  two  point  estimation  process  when 
applied  to  actual  software  data.  A few  sample  calculations  will  be  given 
to  demonstrate  the  mechanics  of  this  particular  approach.  The  data  on  the 
following  page  was  obtained  from  reference  [134). 


i 
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Date 

Errors 

Cumulative 
Number  of  Errors 

CPU  time 

Cumulative 
CPU  time 

Errors  Per  Unit 
of  CPU  time 

1/12 

8 

8 

0.50 

0.50 

16.00 

1/15 

7 

15 

0.60 

1.10 

11.67 

1/16 

1 

16 

0.65 

1.75 

1.54 

1/17 

8 

24 

1.90 

3.65 

4.21 

1/18 

16 

40 

1.59 

5.24 

10.06 

1/19 

18 

58 

8.83 

14.07 

2.04 

1/22 

13 

71 

9.94 

24.01 

1.31 

1/23 

8 

79 

7.25 

31.26 

1.10 

1/24 

9 

88 

8.34 

39.60 

1.08 

1/25 

2 

90 

3.86 

43.46 

0.52 

1/26 

6 

96 

13.11 

56.57 

0.46 

1/27 

3 

99 

34.15 

90.72 

0.09 

1/29 

3 

102 

82.70 

173.42 

0.036 

1/30 

2 

104 

1.10 

174.52 

1.18 

1/31 

3 

107 

51.59 

226.11 

0.06 

Fig.  6 F-11D  Program  Error  Data 


As  can  be  seen  in  column  6 of  Figure  6,  the  error  behavior  is  somewhat 
erratic  through  day  1/18.  Thus,  day  1/18  will  be  analyzed  along  with  day  1/19 
(chosen  arbitrarily).  From  equation  5 we  can  calculate  the  following: 


X = 18/8.83  = 2.038505 

S1 


X = 13/9.94  = 1.307847 

S2 


A 


i 


so 


For  simplicity,  let  us  set  IT  = 1.  Moranda,  in  [80], 

that  the  number  of  instructions  1^,  is  only  a nuisance 

by  the  'normalization'  of  e (t.)  and  e (x„),  which  is 

cl  c 2 

numerical  values".  Then,  by  definition  8,  we  have 

e (t  ) = 40 
c 1 


ec(T2)  = 58 


Hence,  from  equation  11,  we  can  estimate  E^. 

E = [25.66286568  - 58]  _ 

T [0.641572  - 1]  yu.ztv 


states  "it  is  clear 
since  it  is  taken  out 
required  to  produce 


Thus,  the  estimated  total  number  of  errors  at  time  t=0  is  approximately  90. 
This  compares  with  107  errors  found  as  reported  in  Figure  6.  Now,  C,  the 
constant  of  proportionality,  is  estimated  by  equation  12  and  for  these  two 
points  (1/18  and  1/19)  the  calculated  value  is: 


2.0385050 

[90.219-40] 


0.0406 


Now  it  is  a relatively  simple  algebraic  exercise  to  get  estimates  of  the 
MTTF's  for  each  of  our  data  points.  For  example,  using  the  data  up  through 
day  1/22  in  conjunction  with  equation  8 gives  the  MTTF  for  1/23. 


0.0406  (90.219-71) 

This  estimated  value  (1.282)  may  be  compared  with  the  actual  MTTF  of  0.90625. 
The  actual  value  was  obtained  by  taking  the  inverse  of  column  2 divided  by 
column  4 (Figure  6)  for  day  1/23.  This  same  procedure  may  be  followed  for 
the  remaining  points  (days).  As  can  be  seen  in  Figures  7 and  8,  varied  results 
were  obtained  when  different  combinations  of  points  were  selected.  Some  of  the 
curves  were  better  predictors  than  others,  however,  even  with  this  small  set 
of  data  points  (15  points)  there  are  105  C,5)  possible  sets  of  points  to  choose 
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from.  Nevertheless,  we  do  not  know  if  any  of  the  curves  is  in  fact  the  best 
predictor.  Figure  8 shows  that  days  1/18  and  1/22  produce  an  E^,  of  107.58 
which  is  the  best  estimate  for  the  pairs  of  points  we  analyzed.  However,  there 
are  over  100  additional  pairs  which  we  did  net  calculate  E^'s  for.  One  possible 
approach  to  the  above  problem  (not  being  able  to  consider  all  combinations  of 
points)  is  to  take  a reasonable  number  of  pairs  and  calculate  a mean  value  from 
the  individual  estimates.  Utilizing  Figure  8,  with  an  n of  6,  we  find  E^,  = 104.5. 

The  estimated  MTTF's  (Figure  8)  show  considerable  variation  with  one  problem 
quite  evident.  If  the  estimated  is  less  than  the  actual  number  of  errors 
(107  in  this  example)  negative  MTTF's  may  occur  as  shown  in  columns  3 and  7. 

Of  course,  such  values  are  meaningless  in  a reliability  analysis. 

At  this  point  we  shall  look  at  the  full  Shooman  model  and  show  that  it  re- 
duces to  the  previously  discussed  Jel inski-Moranda  model.  Jelinski  and  Moranda 
proposed  a hazard  function  of  the  form 


Z(Xt)  = 0 |N  - (i-1) ] 
th 


where  X^  = the  length  of  the  i — debugging  interval  (the  t 
between  the  i—  and  (i-1)—  errors 


ime 


0 = a proportionality  constant 

N = the  total  number  of  initial  errors 

1 = the  number  of  errors  discovered  after  i intervals 
Recalling  the  Shooman  hazard  function  as  being 


where 


Z(t) 


er<T> 


C (e r ( t)  ) 

!l  “ ec(  x) 


(Eq.  13) 
(Def.  14) 

(Def.  15) 
(Def.  16) 
(Def.  17) 

(Eq.  14) 
(Eq.  15) 


we  can  show  that  the  two  respective  hazard  functions  (equations  13  and  14)  are 
identical  if 


N 


ec  ( t)  = 1-1 


(Def.  18) 

(Def.  19) 

(Def.  20) 
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In  [117]  Shooman  states  that  definitions  18  and  19  are  equal  with  the  authors 

merely  selecting  different  notation.  Definition  20  would  be  true  if  £ (t)  = i/T  . 

c T 

Utilizing  previous  definitions  (Def.  1-7)  associated  with  the  Shooman  model 
and  making  the  required  substitution  will  show  that  the  Shooman  exponential 
model  does  reduce  to  the  Jelinski-Moranda  model.  Below  is  the  step  by  step  proof. 


Z(t) 


C(ef  (t)  ) 


Substituting  equation  15  in  for  £ (t)  yields 


Z(t)  = C 


Cc  ( 


(Eq.  16) 


(Eq.  17) 


From  definition  20  we  have 

Z(t)  = 


"T  - (1-1) 


(Eq.  18) 


Bringing  the  I term  outside  the  parenthesis  results  in 


Z(t)  = [ E - (i-1) ] 

T 


(Eq.  19) 


Now  we  have  previously  defined  <t>  = C and  N = E^  so  that 

*T 


Z(t)  = (9  [N  - (i-1)  ] 


(Eq.  20) 


Thus,  equation  20  shows  that  the  two  models  (Shooman  and  Jelinski-Moranda)  do 
have  identical  hazard  functions.  Thus,  the  outcome  of  applying  either  model 
should  produce  identical  results. 

The  Shooman  model  has  been  adapted  by  Dickson  et.al.  [30]  with  a moderate 
degree  of  success.  It  also  was  the  basis  of  a study  and  extension  by  Miyamoto 
[77]  (to  be  discussed  later)  which  attempts  to  include  a non-zero  probability 
of  generating  new  errors. 

1.6.3  Extended  Jelinski-Moranda  Model 

As  we  have  shown,  both  the  Shooman  exponential  model  and  the  Jelinski- 
Moranda  model  require  a sequence  of  times  between  failures  in  order  to  estimate 
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their  parameters.  It  may  be  the  case,  however,  that  such  data  is  not  available, 
and  we  may  only  have  data  on  the  number  of  errors  which  occurred  in  a time  in- 


* 


terval.  If  this  is  the  case,  then  the  models  can  easily  be  extended  and  adapted 
to  this  situation. 

The  original  models  assume  a constant  failure  rate  between  consecutive 
errors.  One  extension  (essentially  due  to  Lipow)  approximates  that  behavior 
by  assuming  that  the  failure  rate  is  constant  over  a time  interval.  Within 
the  time  interval,  failures  occur  according  to  a Poisson  distribution  with 
the  constant  failure  rate  as  its  parameter.  Between  time  intervals,  the 
failure  rate  declines  proportionally  to  the  cumulative  number  of  errors  de- 
tected in  the  previous  intervals  (Figure  9).  The  basic  assumptions  associated 
with  the  extended  Jelinski-Moranda  model  are  the  same  as  those  of  the  basic 
Jelinski-Moranda  model.  Thus,  the  reader  may  refer  back  to  the  section  covering 
the  basic  model  for  the  necessary  assumptions.  With  the  extended  model  it  is 
also  assumed  that  more  than  one  error  may  occur  in  a given  time  debugging  period. 

Stated  symbolically,  if  we  let  n^  be  the  cumulative  number  of  errors  found 
through  the  i—  time  interval,  then  the  hazard  function  during  the  i—  time 
interval  is: 


where  0 
N 


1 


Z(t.)  = 0 [N  - n._1] 

a proportionality  constant 

the  total  number  of  initial  errors 

the  cumulative  number  of  errors  found  through 

the  i—  time  interval 

the  i—  debugging  interval 


(Eq.  1) 

(Def.  1) 
(Def.  2) 
(Def.  3) 


(Def.  4) 


Below  is  a description  of  the  approach  to  be  used  in  estimating  the  model  para- 
meters. We  would  begin  with  the  density  function  and  the  corresponding  likelihood 
function.  After  taking  the  logarithms  and  partial  derivatives  with  respect  to  0 
and  N we  are  left  with  equations  2 and  3. 
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ERROR  RATE 
Z ( T ) 


FAILURE  RATE  DURING  AN  INTERVAL  IS  PROPORTIONAL 
TO  THE  NUMBER  OF  REMAINING  ERRORS  AT  THE 
BEGINNING  OF  THE  INTERVAL 


m 

E 

i=l 

ra . m 

1 ” E 

0 t±  = 0 

(Eq.  2) 

N -Vi  i-i 

where  m = 

the 

total 

number  of  time  intervals 

(Def.  5) 

m . = 
l 

the 

number  of  errors  found 

in  the  i—  time  interval 

(Def.  6) 

N,  n^ 

0 

and  t 

^ were  previously 

defined. 

We  also  have 

n 

0 

m 

- £ ( N - n 

i=l 

) t.  = 0 

(Eq.  3) 

where  n = 

the 

number  of  errors  found 

to  date 

(Def.  7) 

The  test  data 

required 

is  made  up  of  the 

number  of  errors  detected 

per  time 

interval  (i.e. 

V 

V 

“y ,mm) 

1 m 

T 1 n-  , O 

A 1=1  i-l  i 

m m . 

n 

= [ 

N - ( 

' -M  ' 

(Eq.  4) 

m 

where  A = 

i 

i=l 

Ci 

(Def.  8) 

This  model  has  been  selected  for  further  study  in  section  1.7. 

Note:  In  [124],  Sukert  had  equation  4 as  division  rather  than  multiplication. 

1.6.4  Schneidewind  Model 

Norman  F.  Schneidewind  has  presented  a model  of  error  detection  and 
correction  which  is  developed  to  serve  as  a decision  aid  for  controlling  the 
quality  of  software  [109].  This  approach  utilized  a non-homogeneous  Poisson 
process  to  model  the  occurrence  of  errors  detected  during  the  testing  phases  of 
software.  The  parameters  (a  and  6 ) are  estimated  via  maximum  likelihood  and 
weighted  least  squares.  Then,  based  on  these  estimates,  forecasts  of  the  cumu- 
lative number  of  detected  errors  can  be  made.  The  inputs  are  error  detection 
histories  while  the  outputs  are  forecasts  of  the  future  behavior  of  error 
correction  and  detection  processes.  A major  objective  of  this  particular  model 
is  to  forecast  the  mean  number  of  cumulative  errors  for  some  future  time  T. 
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Schneidewind,  in  [109],  makes  the  following  point,  "Unlike  hardware 
which  wears  out  or  deteriorates  with  time,  software  should  improve  with  time  as 


IT 


more  of  the  latent  errors  are  detected  and  corrected.  However,  there  are  ex- 
ceptions to  this  general  characteristic.  When  an  error  is  removed,  it  is  possible 
that  one  or  more  errors  will  be  introduced". 

Then,  in  [138]  Schick  and  Wolverton  further  state  that  "errors  may  reside 
undetected  in  software  for  many  years  until  a particular  set  of  input  data  causes 
a previously  untraversed  module  path  to  be  executed". 

Following  a listing  of  the  assumptions,  the  various  methods  of  this  model 
will  be  analyzed. 

(1)  he  number  of  errors  which  is  detected  during  a time  interval  and 
the  collection  of  error  counts  over  a series  of  time  intervals  are 
modelled  by  a random  variable  and  a stochastic  process. 

(2)  Prior  to  the  selection  of  a test  plan,  all  errors  are  equally  likely. 

(3)  The  number  of  errors  detected  in  each  time  interval  is  independent  of 
the  number  detected  in  another  time  interval. 

(4)  Detected  error  counts  in  each  interval  have  the  same  form  (type  of 
distribution)  but  have  different  means. 

(5)  The  mean  number  of  detected  errors  decreases  from  interval  to 
interval . 

(b)  The  rate  of  detection  in  an  interval  is  proportional  to  the  number 
of  errors  in  that  interval. 

(7)  Specifically,  the  detected  error  process  is  assumed  to  be  a non- 
homogeneous  Poisson  process  with  an  exponentially  decreasing 
intensity  function  (failure  rate). 

(8)  The  error  correction  rate  is  proportional  to  the  number  of  errors 
to  be  corrected. 


i 

1 


1 

J 
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The  assumptions  of  this  model  seem  to  be  straightforward,  and  we  feel  that 


they  are  also  fairly  realistic.  Also,  as  with  many  of  the  other  analytical  models, 
the  required  test  data  consists  of  the  sequence  of  number  of  errors  in  each  inter- 
val . 

Since  the  error  process  appears  to  change  over  time,  recent  error  data  would 
seem  to  be  more  useful  than  earlier  data.  Thus,  we  must  carefully  consider  the 
following  statements: 

(1)  to  what  extent  should  historical  data  be  considered  in  forecasting,  and 

(2)  how  much  of  the  historical  time  record  to  include  when  estimating 
a and  8. 

Based  on  these  points,  Schneidewind  presents  three  approaches  for  this 
model  which  utilize  the  available  data  in  different  ways.  First,  with  method  1, 
all  error  counts  in  intervals  1 through  t are  used, 

where  t = the  last  interval  for  which  data  is  available  (Def.  1) 

Schneidewind  further  states  that  this  particular  method  is  appropriate  if  the 
changes  in  the  error  counts  from  intervals  1 to  t are  representative  of  the 
future  ability  to  detect  errors.  In  such  a case,  maximum  likelihood  can  be 
applied  to  all  error  counts  (X^)  for  intervals  1 through  t.  Below  is  the  ap- 
proach to  be  used  for  method  1. 

First  we  define, 

6 = a model  constant 

a = a model  constant  (error  detection  rate  at  time  0) 

Now,  to  determine  8,  the  following  polynomial  must  be  solved  for  y. 

Ayt+*  - (A+lly11  + (t-A)y  + (A+l-t)  = 0 y>l 

k£o  kXk+l 


(Def.  2) 
(Def.  3) 


where  A 


5 


i^ 


X^  = the  number  of  errors  found  in  interval  k 

t = previously  defined  (Def.  1) 

8 


(Eq.  1) 
(Def.  4) 

(Def.  5) 


y 


(Def.  6) 


Thus,  6 = In  y 


(Def.  7) 


Now,  a is  determined  by 


iiO 


(Eq.  2) 


Upon  obtaining  estimates  of  a and  B we  can  find  the  predicted  number  of  errors 


for  each  interval  i by 

nu  = (a/6)  [exp  (—6(1—1))  - exp  (—61)  1 
where  nu  = the  estimated  number  of  errors  in  interval  i 


(Eq.  3) 


(Def.  8) 


Next,  for  method  2,  a different  approach  is  taken.  Here,  none  of  the  error 
counts  in  intervals  1 through  s-1  are  used  (2<_s<_t)  and  all  intervals  from 
s to  t are  used. 


s = an  index  with  unit  increment 


(Def.  9) 


This  type  of  an  approach  is  appropriate  if  the  most  recent  observations  (inter- 
vals s through  t)  appear  to  be  more  representative  of  the  future  ability  to 
detect  errors.  With  tills  method,  some  criteria  for  selecting  s is  necessary. 

In  such  a case,  a and  6 must  be  determined,  again  by  maximum  likelihood  for  all 
values  of  s from  2 through  t.  Once  again  we  begin  by  solving  a polynomial  for  y. 


r+l  c 

Ay  - (A+l)y  + (C-A)y  + (A+l-C)  =0  y>l 


where  A = — 


t-s 

k=0  kXs+k 

Jl  Xk 


(Eq.  4) 


(Def.  10) 


C = t-s+1 


(Def.  11) 


B = In  y 


(Def.  12) 


Then , 


(k=l  V6 


(Eq.  5) 


In  effect,  the  results  will  give  a and  6 for  all  s such  that  2<s<t. 

s s 

Now,  tiie  best  results  are  desired,  and  Schneidewind  discusses  a technique  for 


* 


A 


determining  the  optimal  values.  He  suggests  computing  the  sum  of  weighted  squared 

deviations,  SD,,,  between  m.  and  X.  from  intervals  1 to  t (intervals  are  of  equal 
W 1 1 M 

length) . 


m.  = see  definition  8 

l 

X = the  actual  number  of  errors  in  interval  i 

l 


(Def.  13) 


The  approach  of  using  SD^,,  which  is  also  used  in  conjunction  with  method  3, 
will  be  discussed  in  more  detail  following  a presentation  of  the  third  method. 

With  method  3,  the  cumulative  error  count  from  intervals  1 through  s-1  is 
used  and  the  individual  error  counts  in  intervals  from  s to  t are  also  used.  This 
particular  method  is  appropriate  if  the  individual  error  counts  in  intervals  1 
to  s-1  are  not  representative  of  the  ability  to  predict  the  future  but  the  cumu- 
lative count  is,  and  the  individual  error  counts  from  s to  t are  also  representa- 
t ive . 

Method  3,  somewhat  similar  to  method  2,  requires  first  the  estimation  of  a 
and  6 for  all  values  of  s from  2 through  t.  The  following  polynomial  should 
then  be  solved  for  y. 

Ays+t  - (A+xs,t)yS+t_1  - (A+sXs-rxs-i)yt+1  + (A+xs,t+sXs-rxs-i)yt  - (A-txt)yS  + 
(A+Xs>t-tXt)yS_1  + (A+sXs_1-tXc-Xs_1)y  - (A+sX^+X^-tX^X^)  = 0 (Eq.  6) 


t-s 


where  A = Z (s+K-l)X 
k=0 


s,k 


X = I X. 
s,  t . k 
k=s 


s-1 

X , = Z X, 
S-X  k=l  ^ 


X = Z X, 
1 k-1  ^ 


8 = In  y 


(Def.  14) 

(Def.  15) 

(Def.  16) 

(Def.  17) 

(Def.  18) 
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t 

( E X,.  )P 


(Eq.  7) 


Let  us  now  consider  in  more  detail  the  weighted  squared  deviations  criteria  for 
selecting  the  optimal  results  associated  with  methods  2 and  3.  The  following 
formula  can  be  utilized  with  both  respective  methods  and  is  given  as 


Sf>w  = exp  (Bi)  [(a/B)[  exp(-  Bi)  ] [ exp  (8)  -1]  -X.J 


(Eq.  8) 


That  is,  you  substitute  in  and  B^  to  get  “2  ar>d  ^3  to  8et  SD^> at 

and  8 to  get  SDt,  . Then,  you  select  the  SD,,  that  is  a minimum  and  record  the 
t Wt  w 

associated  positive  values  of  a and  6 . 

The  above  method  is  used  on  methods  2 and  3 only  and  compares  the  SD,, 

w 

within  methods.  If  we  wish  to  compare  between  methods  1,  2 and  3,  unweighted 
squared  deviations  are  computed  for  each  method,  and  the  minimum  value  constitutes 
the  optimal  result.  These  unweighted  squared  deviations  are  calculated  between 


forecasted  errors  and  actual  errors  in  intervals  t+1  to  T, 
where  T = some  future  time. 


(Def.  19) 


Finally,  Schneidewind  also  discusses  the  error  detection  rate  and  error 
correction  rate  in  terms  of  a's,  g's  and  i's.  A more  complete  explanation  of 
this  material  is  covered  in  [109].  The  Schneidewind  method  1 and  method  3 
(s=2)  are  both  equivalent  to  the  Geometric-Poisson  model.  A proof  of  this  has 
been  included  in  section  1.6.6  which  discusses  the  Geometric-Poisson  model. 

This  model  has  been  selected  for  further  study  and  will  be  applied  to 
data  in  order  to  better  evaluate  its  utility  in  section  1.7. 


1.6.5  Geometric  Model 

P.  B.  Moranda,  a co-developer  of  the  previously  discussed  de-eutrophication 
model,  has  also  developed  a geometric  model  which  exhibits  many  similarities  to 
the  earlier  model.  However,  certain  variations  are  evident,  among  them  being 
that  this  particular  model  does  not  assume  a fixed,  finite  number  of  errors  nor 


] 
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does  it  assume  that  each  error  has  the  same  likelihood  of  detection.  On  the 


♦ 


contrary,  the  geometric  model  assumes  that  successive  errors  become  harder  to 
find  (which  is  likely  to  occur  in  a relatively  mature  piece  of  software). 

Moranda  contends  that  the  assumption  of  a non-finite  number  of  errors  may  be 
realistic  [79]  "There  are  those  who  believe  that  there  are  not  a finite  number 
of  errors  in  a large  real  time  program;  certainly  this  is  so  if  there  is  an 
attempt  to  mirror  in  software  all  of  the  continuum  of  eventualities  which  occur 
in  complex  dynamic  situations."  This  same  paper  further  states,  "the  assumption 
that  all  errors  have  the  same  likelihood  of  detection  is  also  an  imperfect 
rendition  of  the  real  situation". 

The  following  assumptions  relate  to  the  geometric  model. 

(1)  Assumes  that  there  are  an  infinite  number  of  total  errors. 

(2)  Errors  do  not  have  the  same  likelihood  of  detection. 

(3)  The  failure  rate  between  successive  errors  forms  a geometric 
progression  and  is  constant  in  the  interval  between  errors. 

(A)  Each  error  discovered  is  immediately  removed,  thus  decreasing 
the  number  of  errors  by  one. 

Essentially,  this  model  is  based  on  the  assumption  that  the  failure  rate 
between  successive  errors  forms  a geometric  progression.  The  error  rate  in  an 
interval  is  proportional  to  the  error  rate  before  the  detection  of  the  last  error 
(see  Figure  10).  It  (the  model)  makes  no  assumptions  about  error  generation,  but 
allows  for  the  possibility  since  the  model  does  assume  that  tnere  are  always 
errors  remaining  (a  very  practical  assumption  for  a programmer  to  make).  If  we 
assume,  as  Belady  and  Lehman  did  [9],  that  at  each  error  correction,  a number  E 
of  errors  is  extracted  and  an  amount  G is  generated,  both  of  which  are  fractions 
of  the  number  of  remaining  errors  at  the  time  the  error  is  corrected,  then  this 
geometric  model  is  appropriate.  In  essence,  this  model  does  not  make  any  assump- 
tions about  the  mechanizations  which  cause  errors;  instead  it  models  how  the  errors 
appear  to  behave. 
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With  the  geometric  model,  the  initial  error  detection  rate  is  constant 


until  the  first  error  is  found.  After  the  first  error  is  detected,  the  error 

2 3 

detection  rate  becomes  DK,  after  the  second  error,  DK  , after  the  third,  DK 

and  so  forth,  where  K is  a positive  number  and  less  than  one. 

s t 

In  general,  after  the  ( i— 1)— — error  has  been  detected,  the  hazard  function 


Z(Xi)  = DK 


(Eq.  1) 


which  is  constant  over  the  interval  between  the  detection  of  the  ( i— 1 )- — and 


th 

i—  errors. 


= the  i — debugging  interval 
D = the  initial  error  detection  rate 
K = a proportionality  constant 

i = the  number  of  errors  discovered  after  i intervals 


(Def.  1) 
(Def.  2) 
(Def.  3) 
(Def.  4) 


These  parameters,  D and  K,  are  estimated  via  maximum  likelihood  and  are  derived 
below.  First,  the  probability  density  function  (P.D.F.)  of  each  interval  X^  is 


given  as: 


p(X.)  = DK1  1 exp  (-DK1  V) 


(Eq.  2) 


and  because  the  intervals  are  assumed  to  be  statistically  independent,  the  like- 
lihood function  is: 


L(X1,X2,X3 Xn)  = ti  DK1  1 exp  (-DK1_1X.) 

i=l 


(Eq.  3) 


where  n = the  total  number  of  intervals  (equals  the  total  number  of  (Def.  5) 

errors  discovered). 

Next,  taking  the  log  of  the  likelihood  function  yields 


n 11 

log  L = 1 log  DK  - l DK  X. 

i=l  i=  1 1 


n i-1  n i-1 

log  L = n log  D + £ log  K - X DK  X. 

i=l  i=l  1 


(Eq.  4) 


(Eq.  5) 


By  taking  the  partial  derivatives  with  respect  to  D and  K we  obtain 


31og  L = n - £ K X = 0 

3D  D 1=1  1 


(Eq.  6 


3log  L = 1 A (i-1)  - D A (i-1)  K1-2  X.  = 0 

— ?r — K 1=1  1=1 

Solving  equations  6 and  7 will  give  the  following  expression 


v iK1X. 

_i£l i 

n vA 

l K X 


n + 1 

2 


(Eq.  7) 


(Eq.  8) 


Now,  if  the  above  equation  is  solved  for  K,  a solution  for  D may  be  found  by 


2 K1  V 

i=l 


(Eq.  9) 


The  test  data  necessary  to  apply  this  model  is  the  sequence  of  times 
between  errors  (i.e.  XJt  X?,  X3, .X^). 

Finally,  the  mean  time  to  failure  (MTTF)  and  the  reliability  may  be  estimated  by 


equations  10  and  11  respectively. 


(Eq.  10) 


R(Xi)  = exp[-DKn  X.]  (Eq.  11) 

This  model  has  been  selected  for  further  study  in  section  1.7. 

1 6.6  Geometric-Poisson  Model 

As  we  have  seen  before,  software  error  data  may  also  be  classified  as 
failures  per  interval,  and  the  nature  of  this  error  detection  process  suggests 
the  Poisson  distribution  as  descriptive  of  the  number  of  errors  detected  in  a 
final  time  period.  Moranda  discusses  this  extension  in  detail  in  reference  [79], 
"Although  there  is,  in  reality,  a continual  purification  which  takes  place  within 
a given  time  period,  expediency  requires  that  the  detection  rate  be  assumed  con- 
stant over  time.  If  the  purification  is  known  to  be  only  partially  and,  in  the 
best  case,  insignificantly,  accomplished  during  the  first  time  period,  then  there 


is  merit  to  the  assumption  of  a constant  rate.  This  of  course,  will  be  the 


situation  if  the  time  period  used  is  short  relative  to  the  total  development  time." 

Under  these  conditions,  the  Poisson  distribution  with  parameter  1 can  be 
used  to  describe  the  errors  detected  in  the  first  time  period.  During  the  second 
time  period,  the  average  number  of  errors  detected  is  assumed  to  be  proportional 
to  the  average  number  detected  in  the  first  interval  (1),  and  so  forth  for  addi- 
tional time  periods  (Figure  11).  Thus,  the  average  number  of  errors  for  each 

2 3 

successive  time  interval  forms  a geometric  progression  (X,XK,XK  , XK  ) and 

the  hazard  rate  during  the  i-^-  time  interval  is: 

„i-l 


Z(t.)  = XK 


(Eq.  1) 


where 


. th 


t ^ = the  i — debugging  interval 


(Def.  1) 

X = the  initial  detection  rate  (Def.  2) 

K = a proportionality  constant  (Def.  3) 

Below  is  a complete  listing  of  the  assumptions  for  the  Geometric-Poisson  model. 

(1)  There  is  a non-finite  number  of  errors. 

(2)  Non-equal  likelihood  of  error  detection. 

(3)  During  a fixed  interval  of  time,  the  number  of  errors  detected 
follows  a Poisson  distribution. 

(4)  During  each  of  these  periods  of  time  (data  collection  intervals) 
the  detection  rate  (parameter  of  the  Poisson  distribution) 

is  constant. 

(5)  Data  is  available  only  at  discrete  intervals. 

(6)  The  detection  rate  in  successive  time  intervals  forms  a 
geometric  progression. 

(7)  Each  error  discovered  is  immediately  removed  or  no  longer  counted. 
The  test  data  necessary  for  utilization  of  this  model  is  the  sequence  of 

number  of  errors  detected  per  time  interval. 
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RROR  RATE 
Z(T) 


ERROR  RATE  IS  PROPORTIONAL  TO  THE  ERROR 
RATE  IN  THE  PREVIOUS  INTERVAL. 


Following  Is  a brief  explanation  of  the  derivation  of  the  parameter 


estimators.  The  probability  density  function  (P.D.F.)  is  given  as: 

•-1  n'  1-1 

P (n  j ) = ( XK1  L)  1 exp  (-  AK  ) (F.q.  2) 

Then,  the  likelihood  function  is: 

L(n^ , n n^)  = ) exp  (-  XK1  ) (Eq.  3) 

Taking  the  log  of  the  likelihood  function  and  then  taking  the  partial  derivatives 
with  respect  to  X and  K yields  the  following  equations: 


III 

m-1 

K1 

1/A  Z 

"i  = 

Z 

(Eq. 

4) 

i=l 

i=0 

m-1 

m-1 

o 

t-o  II 

•H 

r< 

iK1  = 

Z 

i=0 

1 ni+l 

(Eq. 

5) 

After  some  algebraic  manipulation 

the  following 

equation  is  obtained: 

in 

(1  - Km)  ( 

1 - K 

) 

= i£i 

n . 
i 

(Eq. 

6) 

K+ (m— 1 ) 

Kmf!  _ 

mKm 

m-1 

iSo 

in 

i+1 

in 

Then, 

I n 
i=l 

X = 

(Eq 

7) 

where  m = the  total  number  of  time  intervals  (Def.  4) 

n^  = the  number  of  errors  detected  in  the  i—  debugging  interval.  (Def.  5) 

Note:  Sukert,  in  [114],  incorrectly  defined  n^  as  the  cumulative  number  of  errors 

detected  up  through  the  i—  time  interval. 

Now,  referring  back  to  the  hazard  rate  for  the  Geometric -Poisson  model  (Eq.  1) 

it  can  be  shown  that  this  model  reduces  to  Method  1 of  Schneidewind's  model  and 

thus  produces  identical  results.  Below  we  will  prove  that  the  Geometric-Poisson 

model  is  in  fact  equivalent  to  Schneidewind's  model  (Method  1). 

After  comparing  numerical  results  from  the  two  models,  we  discovered  that: 

-B 
e 


K 


(Eq.  8) 


Thus,  the  hazard  function  of  the  Geometric-Poisson  (Z(t.)  = Xk  ) may  be 


rewritten  as: 


- 6(1-1) 


Z(t.)  = Xe  (Eq.  9) 

Hence,  if  we  equate  Z(t.)  (the  hazard  rate  for  the  i—  time  interval  of  the 
Geometric-Poisson  model)  with  ra.  (the  estimated  number  of  errors  for  interval  i 


of  Schneidewind ' s Method  1)  we  have: 


(Hie’  “O-1’  - o’  61  1 


where  X = ( a/8  ) 


Using  equation  8,  equation  10  may  be  rewritten  as: 


-8  (i-1) 


,v\ r " - Si 

(X) [e  - e 


Dividing  through  by  e 1 ^ yields 


X = X - 


B(i-l) 


(Eq.  10) 


(Eq.  11) 


(Eq.  12) 


X » X - X e 


(Eq.  13) 


X = ( a/P  ) - (a/6  ) (e  _P  ) 


(Eq.  14) 


X = ( a/P  ) [ 1 - 


(Eq.  15) 


Thus,  by  substituting  equations  8 and  15  in  for  K and  X respectively  we  find 
that  the  Geometric-Poisson  hazard  function  (equation  1)  is  equivalent  to  m^  of 
the  Schneidewind  model  and  equations  16  and  17  support  this  statement. 


Z ( 1 1 ) = ( a/P  )(  1 - e 6 ) ( e“P  (i_l) 


(Eq.  16) 


Z ( 1 1 ) = ( a/6  ) [ e“  8(i“°  - e"  Bi 


(Eq.  17) 


which  is  equivalent  to 


Z(t.)  * XK 


(Eq.  18) 


Of  course  we  could  prove  this  reduction  working  in  the  opposite  direction.  Since, 
K1'1  - o'  (Eq.  19) 


K - e 


(Eq.  20) 


then  8 


(Eq.  21) 


Then,  if  we  equate  Z(t  ) and  m,  from  the  Geometric-Poisson  and  Schneidewind 

i i 


models  respectively  we  have 


,„i-l  , , , - 6(i-l)  - Si, 

XK  = ( a/S  ) [ e - e 


(Eq.  22) 


Substituting  equation  21  into  equation  22  yields 

, „i-l  „ , - (-InK)  (i-1)  _ -(-lnK)i 


,„i-l 

xk  = f a \ 

- In  K 


a \ [ e 


] (Eq.  23) 


XK1_1(-  InK)  = a[  K1_1  - K1  ] 


XK1”1 (-  InK)  = a K1  1 [ 1 - K 


(Eq.  24) 


(Eq.  25) 


Now,  if  we  divide  both  sides  of  equation  25  by  K [1-K]  we  have  an  expression 


q = X ( - In  K ) 
1-K 


(Eq.  26) 


Thus,  if  we  substitute  our  expressions  for  a and  S (equations  26  and  21)  into 
the  right  hand  side  of  equation  22,  we  obtain: 


X(  - In  K) 
1 - K 

- In  K 


-(-  In  K) (i-1)  -(In  K)  i 
e - e 


(Eq.  27) 


Dividing  through  by  -In  K and  simplifying  yields: 


_J 1 e 

1 - K 


In  (K1_1)  _ In (K1 ) 


(Eq.  28) 


X [ K1  1 - K* 


(Eq.  29) 


which  is  the  same  as 


[ K1  ( i - 1 ) ] 

K 


T-k  l x1  < } 1 


Dividing  through  by  1 - K gives 


X [ (kS(  | ) 1 

X [ (KJ)(  k-1)] 


(Eq.  30) 


(Eq.  31) 


(Eq.  32) 


(Eq.  33) 
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Hence, 


(Eq.  34) 


Z(t.)  = *K1_1 

m.  = Z(tt)  (Eq.  35) 

Since  the  Geometr ic-Poisson  and  the  Schneidewind  model  (Method  1)  are  equivalent, 
we  can  work  with  either  model  and  obtain  identical  results.  Because  of  the  greater 
flexibility  of  the  Schneidewind  model,  we  will  work  with  it  rather  than  the 
Geomet  ric-Poisson. 

1.6.7  Miyamoto's  Revised  Shopman  Model 

In  1973  Isao  Miyamoto  attempted  to  expand  the  basic  Shooman  model  to  in- 
clude MacWilliams'  work  on  input  domains  [72]  and  a factor  representing  errors 
caused  by  erroneous  debugging.  In  this  model , error  detection  is  a function  of 
the  inputs  to  the  system.  The  input  space  consists  of  all  possible  inputs. 

During  testing,  the  system  is  exposed  to  some  subset  of  the  input  space;  this 
subset  is  the  test  space.  Similarly  each  user  will  explore  a particular  portion 
of  what  Miyamoto  calls  the  "user  space".  Errors  are  triggered  or  revealed  by 
certain  elements  of  the  input  space.  These  elements  taken  together  comprise 
what  might  be  called  an  error  space.  Miyamoto  assumes  that  the  error  space  has  a 
finite  initial  size  for  any  given  program  or  system,  and  is  expanded  only  by 
erroneous  debugging.  Under  this  model  errors  are  detected  and  subsequently 
corrected  wherever  the  test  space  overlaps  the  error  space.  If  the  correction 
is  successful,  the  error  space  is  reduced.  Miyamoto  defines: 

Er ( t)  = [Et  + K(t)]  - Ec(t)  (Eq.  1) 

Er(x)  = remaining  errors  at  time  t (Def.  1) 

E^.  = number  of  initial  errors  in  program  (Def.  2) 

K(t)  = number  of  additional  errors  due  to  erroneous  debugging  (Def.  3) 
Ec(t)  = errors  corrected  at  time  t (Def.  4) 

When  the  test  space  and  error  space  are  disjoint,  the  system  can  be  said  to  be 
fullv  debugged  for  that  test  space.  Clearly  our  concern  must  be  with  those  inputs 


which  are  in  the  error  space  and  user  space  but  outside  the  test  space.  As 
exhaustive  testing  is  impossible  [13],  the  test  space  cannot  be  expanded  to 
include  all  of  the  user  space.  A great  deal  of  research  has  been  done  and 
reported  elsewhere  on  techniques  to  select  optimal  input  spaces  for  software 
testing  and  validation.  When  software  is  developed  it  is  generally  exposed  to 
a test  space  of  steadily  growing  size.  Periodically  new  inputs  are  added  and 
new  tests  are  devised.  The  rate  of  detection  of  new  errors  will  be  dependent 
on  the  rate  of  increase  in  the  size  of  the  test  space.  At  the  same  time,  if  the 
test  space  expands  in  the  right  "direction",  the  number  of  latent  errors  and  the 
hazard  rate  for  the  user  space  will  be  reduced.  The  relationship  between  the 
various  spaces  is  diagrammed  in  Figure  12. 

This  is  the  most  extensive  effort  made  to  date  to  examine  underlying 
mechanisms  in  software  reliability.  As  a conceptual  model  it  does  an  excellent 
job  of  describing  the  causes  of  the  software  behavior  that  the  analytical  models 
as  a class  attempt  to  predict.  Unfortunately,  before  a numerical  model  can  be 
constructed  several  questions  must  be  answered. 

(1)  If  a model  is  made  for  the  test  space,  how  is  reliability  related 
to  the  errors  present?  (By  definition  of  the  test  space  you  have 
detected  all  of  the  errors  which  are  present,  but  how  do  the  un- 
corrected errors  affect  reliability;  how  often  will  you  re-encounter 
them? ) 

(2)  How  can  observations  made  in  the  test  space  be  used  to  predict 
behavior  in  the  user  space? 

(3)  How  will  the  reliability  estimate  be  affected  if  errors  in  the 
test  space  go  undetected,  particularly  errors  caused  by  erroneous 
debugging? 

By  combining  the  solutions  to  those  questions  a model  would  be  developed  that 
would  relate  error  detection  experience  in  the  growing  test  space  to  the  popula- 
tion, or  incidence,  of  errors  in  the  user  space  and  then  project  the  effect  of 
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those  errors  on  user-encountered  reliability.  Unfortunately  at  this  point  no 
model  appears  to  be  available  based  on  good  solutions. 

Miyamoto  presents  a reliability  model  using  the  solutions  outlined  below: 
Question  (1):  How  is  reliability  related  to  the  errors  present? 

In  [77]  Miyamoto  presents  as  Figure  9 (reproduced  in  Figure  13  in  this  paper)  a 
graph  of  error  occurrence  rate  versus  number  of  remaining  errors,  and  states: 
"This  may  allow  [us]  to  conclude  that  the  error  rare  is  approximately 
proportional  to  the  number  of  observed  errors  remained  [sic]  uncorrected, 
where  the  system  operates  stationary  [sic]  and  the  error  occurrence  is 
assumed  to  be  distributed  uniformly." 

Reading  the  data  off  Figure  13  we  calculate  a population  correlation 
coefficient  of  .4971.  This  is  a significant  relationship  only  if  we  accept  an 
alpha  or  Type  I risk  larger  than  10%. 

Question  (2):  How  can  observations  in  the  test  space  be  related  to 
behavior  in  the  user  space? 

Miyamoto  constructs  separate  models  for  the  test  space  and  the  input  space  both 
based  on  remaining  errors,  then  states: 

"The  difference  between  this  model  for  [the]  input  space  and  the 
model  for  test  space  is  that  the  remaining  errors  are  observable  only 
in  the  latter." 

No  method  is  proposed  to  estimate  E^,  initial  latent  errors.  When  Miyamoto 
applies  his  model  to  a large  online  system  with  six  outstanding  errors  not  yet 
corrected  and  gets  a MTTE  of  396.5  hours,  he  is  implicitly  assuming  that  all 
possible  user  inputs  have  been  tested  (test  space  = user  space),  that  no  other 
errors  exist,  and  that  the  linear  relationship  described  in  question  1 holds. 
Clearly,  this  implies  that  once  those  last  six  errors  are  fixed  the  MTTE  will  be 
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infinite,  and  the  system  will  never  fail. 


OF  REMAINING  ERRORS 
ENCE  RATE  X$ 


X = 0 .0094  Er(T)-0  0187 


J I I L_ 


Thus,  although  Miyamoto  has  constructed  a good  conceptual  model  of  the 
software  "de-eutrophication"  process,  we  feel  that  his  analytical  model  contains 
a number  of  questionable  assumptions  that  render  it  unsuitable  for  our  purposes. 
We  suspect  that  this  conceptual  model  may  in  fact  be  too  detailed  to  serve  as 
the  basis  of  an  analytical  model  given  the  current  state  of  the  art  in  software 


reliability  research. 

Variable  Definition 

A.  Error  model  for  input  space 

E^,  = initial  number  of  latent  errors  in  the  program.  (Def.  5) 

K(t)  = additional  errors  caused  by  erroneous  debugging.  (Def.  6) 

Er(t)  = the  number  of  remaining  errors  at  time  x.  (Def.  7) 

Ec(x)  = number  of  corrected  (detected)  errors  at  time  x.  (Def.  8) 

t = debugging  time  in  calendar  time.  (Def.  9) 

t = operational  time.  (Def.  10) 

C = constant  of  proportionality.  (Def.  11) 

B.  Test  space  of  growing  size 

Eo(x)  = total  number  of  observed  errors  caused  by  the  new  (Def.  12) 


test  jobs  at  time  x (This  error  occurrence  rate  may  be 
assumed  proportional  to  the  test  space  growth  rate) 

[i.e.  the  total  number  of  latent  errors  that  will  be 
observed  in  the  "new"  test  space]. 

Ec(T)  = the  total  number  of  corrected  errors.  (Def.  13) 

Ko(x)  = the  total  number  of  observed  additional  errors  (Def.  14) 

because  of  erroneous  debugging  at  time  x. 

Ero(x)  = total  number  of  observed  remaining  errors  at  time  x.  (Def.  15) 
(Note:  Since  the  test  space  is  growing  andsinceall  errors  in  the  test 
space  will  be  observed,  Eo(x)  is  the  number  of  latent  errors 
contained  in  the  test  space  which  is  defined  at  time  x and  is 
not  the  number  of  errors  actually  observed  at  time  x ). 
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C.  Test  space  of  fixed  size 

E_  = initial  number  of  latent  errors  to  be  oboerved  in  the  (Def.  16) 
To 

program  (i.e.  Eo(t)  = E^^  which  is  a constant). 

Ko(x)  = total  number  of  observed  additional  errors  caused  (Def.  17) 

by  erroneous  debugging. 

Ero(t)  = total  number  of  observed  remaining  errors  in  the  (Def.  18) 

fixed  test  space. 


Required  Test  Data 

Sequence  of  times  between  errors  over  the  entire  test  space. 

Assumptions  of  Miyamoto's  Model 

(1)  The  errors  which  exist  in  the  "test"  space  will  be  actually  observed. 

(2)  The  number  of  errors  in  a program  (at  the  start  of  testing)  is  a 
constant  and  decreases  directly  as  errors  are  corrected. 

(3)  However,  erroneous  debugging  does  introduce  new  errors. 

(4)  The  failure  rate  is  proportional  to  the  number  of  residual  errors. 

(5)  It  is  implicitly  assumed  that  each  error  has  an  equal  chance  of 
being  detected. 

(6)  The  error  model  developed  for  the  "test  space"  roughlv  parallels  the 
appropriate  model  for  "input  space"  (or  at  least,  the  "user  space"). 

(7)  The  test  space  can  be  brought  to  coincide  with  the  user  space  in  order 
to  exhibit  a high  degree  of  software  reliability  to  the  user. 

The  "input  space"  is  the  set  of  all  possible  inputs  to  the  system. 

The  "test  space"  is  the  subset  of  the  input  space  which  can  be  "seen" 
by  the  testing  personnel  (i.e.  that  subset  of  the  input  space  actually 
used  during  testing). 

The  "user  space"  is  the  subset  of  the  input  space  which  is  utilized 
by  the  user. 
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Hazard  Function 


(1)  For  input  space  the  error  model  is: 

Er(T)  = [Et  + K(T)]  - Ec(t)  (Eq.  2) 

Then,  the  hazard  function  is: 

A(t)  = C[Et+K(t)]  - Ec(t)  (Eq.  3) 

(2)  For  test  space  of  growing  size  the  error  model  is: 

Ero(x)  = [Eo(x)  + Ko(x)]  - Ec(x)  (Eq.  A) 

Then,  the  hazard  function  is: 

A ( x)  = C [ Eo ( x ) + Ko ( x ) ] - Ec  (x)  (Eq.  5) 

(3)  For  test  space  of  fixed  size  the  error  model  is: 

Ero(x)  = [ETq  + Ko ( x ) ] - Ec(x)  (Eq.  6) 

The  corresponding  hazard  function  is: 

A ( t ) = C[ETo  + Ko ( x ) ] - Ec(x)  (Eq.  7) 

Reliability  Equations 

(1)  For  a test  space  of  growing  size: 

R(t,x)  = exp  ( - C[Ero(x)]t)  (Eq.  8) 

= exp  ( - C[Eo(x)  + Ko(x)  - Ec(x)]t)  (Eq.  9) 

(2)  For  a test  space  of  fixed  size: 

R(t,x)  = exp  ( -C[Ero(x)]t)  (Eq.  10) 

= exp  ( ~ct^ETo  + Ko (x) }-  Ec(x)]t)  (Eq.  11) 

(3)  For  input  space 

R(t,x)  = exp  ( -C{Er(x)}t)  (Eq.  12) 

= exp  ( -C[{Et  + K(x ) } - Ec(x)]t)  (Eq.  13) 

MTTE  (MTBSE  - Mean  Time  Between  Software  Errors) 

(1)  For  a test  space  of  growing  size: 

MTBSE  = 1 (Eq.  14) 

C{ Ero(x) } 

1 (Eq.  15) 

' C [ { Eo (x ) +Ko ( x ) } - Ec (x) ] 


(2)  For  a test  space  of  fixed  size: 


» 


MTBSE  = 1 (Eq.  16) 

C{Ero(t) ) 

= 1 

C[{ETo  + Ko(t)}  - Ec (T) } (Eq.  17) 


(3)  For  input  space: 

MTBSE  = 1 

C(  Er  (t  ) ) 


(Eq.  18) 


1 

C[{Et  + K(t  ) } - Ec  (t  ) ] 


(Eq.  19) 


1.6.8  Manpower  Limited  Model 

The  Manpower  Limited  model  proposed  by  Shooman  and  Natarajan  in  [113]  is 
an  attempt  to  relax  the  assumption  that  the  number  of  errors  in  a program  remains 
constant.  The  hazard  function,  equal  to  the  error  detection  rate,  is  an  input  to 
this  model  and  is  assumed  to  be  constant.  Instead,  this  model  is  formulated  to 
yield  an  expression  for  n(T),  the  number  of  errors  remaining  in  the  software.  The 
authors  commence  by  initially  developing  a difference  equation  for  the  number  of 
errors  in  the  program: 

Errors  present  at  time  T^  = [errors  present  at  time  (Eq.  1) 

+ [errors  generated  in  the  interval  (Tj-t  )] 

- [errors  removed  in  the  interval  (t^-t^  ^)] 

If  we  let: 

(1)  ng(Ti»Tt  = number  of  errors  generated  in  the  interval  (t^-t^  (Def.  1) 

(2)  n,(T,,T.  .)  = the  number  of  errors  detected  in  the  interval  (t.-t.  ,)  (Def.  2) 

a i i-1  i i-1 

(3)  n (T,,T.  . ) “ the  number  of  errors  corrected  in  the  interval  (t.-t.  ,)  (Def.  3) 

c 1 l-i  i i-l 

then  the  above  difference  equation  may  be  written  symbolically  as: 

n(Tl)=  "(t^)  +ng(Ti»Ti_1)  -nc(Tl.Ti.1)  (Eq.  2) 
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This  difference  equation  is  then  transformed  into  a differential  equation  by 


t 


grouping  terms, 
resulting  in: 


where 


dividing  both 

dn(T)  = r 
dr  8 

r (t.)  = the 
g J- 

r (t.)  = the 
c 1 

rd(t.)  = the 


sides  by  (T  -T 

(T)  - rc(T) 

generation  rate  of 
correction  rate  of 
detection  rate  of 


= At  and  then  taking  limits. 


(Eq.  3) 

new  errors  in 

time  t 

l 

(Def.  A) 

new  errors  at 

time  T 

i 

(Def.  5) 

errors  at  time 

T . 

i 

(Def.  6) 

This  differential  equation  can  then  be  solved  for  n if  the  appropriate 
rates  are  known. 

The  basic  assumptions  associated  with  the  Manpower  Limited  model  are  listed 
below  with  a short  discussion  following. 

(1)  There  is  a finite  number  of  initial  latent  errors  in  a program. 

(2)  New  errors  may  be  generated  during  debugging. 

(3)  The  error  correction  rate  remains  constant  during  the  early  stages 
of  debugging,  but  later  decreases,  proportional  to  the  number  of 
remaining  errors. 

(A)  The  error  generation  rate  is  proportional  to  the  product  of  the 
number  of  remaining  errors  and  the  rate  of  detected  errors. 

(5)  The  error  detection  rate  is  constant. 

The  major  assumption  of  this  model  is  that  the  correction  rate  [r^(Td)] 
remains  constant  during  the  early  stages  of  debugging  (manpower  limited)  and  is 
decreasing,  proportional  to  the  number  of  remaining  errors,  in  later  stages.  This 
assumption  was  made  in  an  attempt  to  reflect  that  which  has  been  practice  in  the 
real  world  and  does  seem  reasonable.  During  the  early  stages  of  testing  the 
manpower  for  debugging  is  limited,  and  the  number  of  errors  is  large  enough  to 
keep  them  constantly  employed,  thus  producing  a constant  correction  rate.  Later 
on  however,  fewer  errors  appear  so  the  debugging  manpower  may  be  cut  back.  Even 
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if  this  is  not  the  case,  we  intuitively  feel  that  later  occurring  bugs  will  be 
harder  to  fix,  resulting  in  a decreasing  correction  rate. 


Although  we  felt  that  the  aforementioned  assumption  is  realistic  we  do 
have  qualms  about  the  assumptions  concerning  the  error  detection  and  generation 
rates.  The  error  detection  rate  is  assumed  to  be  constant  for  the  entire  program, 
thus  severely  limiting  the  applicability  of  the  model.  It  appears  that  this 
assumption  was  one  of  convenience  since  one  would  expect  the  detection  rate  to 
decrease  with  time.  The  second  questionable  assumption  is  that  the  error  genera- 
tion rate  is  proportional  to  the  product  of  the  number  of  remaining  errors  and 
the  rate  of  error  detection.  We  see  no  viable  reason  why  it  should  be  proportional 
to  the  number  of  remaining  errors  and  question  its  relation  to  the  detection  rate. 

At  any  rate,  we  also  disagree  with  this  assumption  as  it  does  not  seem  to  reflect 
reality.  Finally,  we  felt  that  this  model  was  not  appropriate  regarding  reliability 
estimation  primarily  because  the  failure  rate  is  required  as  an  input. 

Although  the  Manpower  Limited  model  is  not  applicable  in  our  research  effort 
a thorough  discussion  of  this  model  and  its  requirements  may  be  found  on  pages 
155-170  of  reference  [113]. 

1.6.9  Musa  Model  i 

John  D.  Musa  6f  Bell  Laboratories  has  developed  a model,  based  in  part  on 
the  Shooman  and  Jel insk i-Moranda  models,  which  is  described  in  [85].  Recently  he 
has  also  made  available  the  user's  guide  | 8^ ] and  programming  manual  [83]  for  a 
computer  program  which  implements  his  model.  The  Musa  model  is  organized  as  two 
sections,  one  dealing  with  reliability  as  a function  of  execution  time  and  the 
other  relating  execution  time  to  calendar  time. 

The  execution  time  component  of  the  model  is  based  or  assumptions  very 
similar  to  those  of  the  Jel inski-Moranda  de-eutrophication  model  and  the  Shooman 
exponential  model.  The  resulting  models,  like  the  other  two  mentioned,  have  hazard 
functions  that  are  linear  functions  of  remaining  errors. 
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Musa  defines  in  [85]: 


T = 

To  exp  (M  T } 

(Eq. 

1) 

o o 

T = 

1 

(Eq. 

2) 

o 

fkN 

o 

M = 
o 

N /B 
o 

(Eq. 

3) 

M = 

Mq  [1-exp  (BCf Kx) ] 

(Eq. 

4) 

where 

T = 

mean  time  between  failures 

(Def 

. 1) 

H 

O 

II 

MTBF  at  start  of  testing 

(Def 

. 2) 

C = 

testing  compression  factor 

(Def 

. 3) 

T = 

execution  time  to  date 

(Def 

. 4) 

II 

O 

x: 

number  of  failures  required  to  expose  all  N 

o 

(Def 

. 5) 

f = 

linear  execution  frequency 

(Def 

. 6) 

K = 

error  exposure  ratio 

(Def 

. 7) 

B = 

error  reduction  factor 

(Def 

. 8) 

II 

o 

z 

number  of  inherent  errors 

(Def 

. 9) 

m = 

errors  found  to  date 

(Def 

. 10) 

by  substituting  equations  2 and  3 into  equation  1 we  get: 

T = 

T exp  (BCfkx) 
o 

(Eq. 

5) 

Equation  4 

can  be  rewritten  as: 

exp  (BCfkt)  = M /(M  - m) 
o o 

(Eq. 

6) 

Substituting  equation  6 into  equation  5 gives: 

T T M 

T = oo 

(Eq. 

7) 

M - m 

o 

If  we  let 

0 » Bfk  = (a  constant) 

(Eq. 

8) 

o‘  o 

N = M (initial  errors) 

o 

(Eq. 

9) 

i-1  = m (errors  to  date) 

(Eq. 

10) 

Equation  7 

becomes : 

T = ' - 

t [N-(i-D] 

(Eq. 

11) 
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This  is  the  MTBF  expression  for  the  Jelinski-Moranda  de-eutrophication  model 


Working  the  other  way  we  find: 

T - 1 

o ~ 0N 

M = N 

o 

m = i-1 


(Eq.  12) 
(Eq.  13) 
(Eq.  14) 


While  the  Musa  model  is  equivalent  to  the  Jelinski-Moranda  model,  it  is 

much  harder  to  work  with.  The  model  used  9 different  constants,  most  of  which 

Musa  says  cannot  be  computed  and  must  be  estimated  from  "similar  programs",  then 

re-estimated  periodically  by  maximum  likelihood.  For  example,  it  may  be  noted 

that  in  the  above  derivation  B,C,f,K,T  ,M  ,N  , and  m are  all  various  constant 

o o o 

factors,  rates,  and  initial  values. 

Musa  goes  beyond  this  basic  model  to  incorporate  an  "error  reduction 
factor",  making  the  error  correction  rate  a linear  function  of  the  detection  rate. 
Using  this  he  then  derives  estimates  of  the  number  of  failures  needed  to  detect 
all  remaining  errors  and  the  execution  time  required. 

We  feel  however,  that  this  assumption  that  effectively  a fixed  fraction  of 
the  errors  are  corrected  before  testing  resumes,  and  the  remainder  must  be  detected 
again  before  they  can  be  corrected  is  most  inaccurate,  making  the  derivation  of 
remaining  failures  and  time  highly  suspect.  It  would  be  far  more  realistic  to 
assume  a definite  time  lag  between  detection  and  correction.  Also,  we  feel  that 
making  the  correction  rate  proportional  to  the  failure  rate  is  far  less  realistic 
than  it  could  be.  We  prefer  the  correction  rate  of  the  Manpower  Limited  model; 
which  is  initially  constant  due  to  manpower  limitations  and  later  decreasing, 
proportional  to  the  number  of  remaining  errors.  This  implies  that  the  last  bugs 
found  are  the  most  subtle,  harder  to  locate  and  fix  correctly. 


The  second  component  of  the  Musa  model,  calendar  time,  is  where  its  great 
value  lies.  Musa  feels  that  execution  time  is  the  best  scale  to  use  with  software. 


» 
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and  develops  the  basic  model  using  that.  Musa  then  recognized  that  the  pace  of 
testing  is  constrained  by  limits  on  three  resources:  failure  identification 
personnel,  failure  correction  personnel,  and  computer  time.  The  calendar  time 
component  of  the  Musa  model  defines  limits  for  these  resources,  relates  execution 
time  to  the  use  of  these  resources,  then  estimates  how  much  calendar  time  it  will 
take  to  supply  the  needed  resources. 

Because  of  the  similarity  between  the  Jelinski-Moranda  and  Musa  models,  we 
have  chosen  to  pursue  our  current  investigation  of  error  detection  models  using 
the  simpler  Jelinski-Moranda  model. 

Musa  Model 

Execution  Time  Component 

C = testing  compression  factor 

B = error  reduction  factor 

T = MTBF  of  start  of  testing 
o 

t = execution  time  of  program 

Nq  = number  of  inherent  (existing  before  the  test  phase)  errors  in 
the  program 

n = number  of  remaining  errors  = N - n (a  function  of  t) 

o 

n = net  number  of  errors  corrected  (a  function  of  t) 
f = linear  execution  frequency  (average  instruction  execution  rate 
divided  by  the  number  of  instructions  in  the  program) 

K * error  exposure  ratio 

— Required  test  data 

Sequence  of  times  to  failure 

— Estimation  of  parameters 

Previous  data,  similar  projects  (K)  n "computed  from  collected  data" 

o 

--  Assumptions  of  the  Musa  model 

(1)  Any  errors  in  the  program  are  independent  of  each  other  and  are 
distributed  at  any  time  with  a constant  average  occurrence  rate. 
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(2)  Thus,  the  number  of  errors  in  a given  time  interval  has  a Poisson 
distribution  (whose  parameter  changes  whenever  an  error  is  corrected), 

(3)  Types  of  instructions  are  reasonably  well  mixed,  and  execution  time 
between  failures  is  large  compared  to  average  instruction  time 
(implicit  in  most  models). 

(A)  The  potential  "test  space"  for  the  program  covers  its  "use  space". 

(5)  All  failures  are  observed. 

(6)  The  error  causing  each  failure  is  fixed  immediately  or  it  is  not 
recounted  (Musa  tends  to  assume  implicitly  that  all  errors  are 
immediately  corrected). 

(7)  Initially  assumes  that  no  new  errors  are  generated  during  debugging. 

(8)  Later  assumes  that  only  a fixed  percentage  of  detected  errors  are 
corrected.  Assumption  (6)  still  applies. 

— Hazard  Function 


Z(t)  - Kfn 


— MTBF 


(1)  MTBF 


To  exp  hr?"  T) 

o o 


(2)  If  initial  MTBF 


(3)  Then,  equation  (1)  becomes 


To exp  hnr^ 

o o 


— Reliability  Equation 


R(t,t')  - exp  [ - fJ  Z (t)  dx]  - exp  [ - t'Z(t)] 


R(t,t' ) « exp  (-  ^y-) 


Musa  Model 


Calendar  Time  Component 


Variable  definition 


M^  » number  of  failures  required  to  expose  all  Nq 

N * number  of  inherent  errors 

o 


T = MTBF  at  start  of  testing 
o 

C = testing  compression  factor  (non-overlap  of  successive  tests) 

For  the  folloiwng  K = C,F,I  for  computer  time,  failure  correction  personnel, 

and  failure  identification  personnel,  respectively. 

At  = calendar  time  required  for  this  resource 

K. 

P = available  units  of  the  resource  (people,  shifts) 

p = amount  of  time  this  resource  is  used  per  failure 

K. 

p = utilization  of  this  resource  (e.g.  fraction  of  day  actually 
K. 

spent  working) 

0 = amount  of  calendar  time  this  resource  is  used  per  unit  of 

lx 


1.6.10  Bayesian  Reliability  Growth  Model 


The  basic  assumptions  of  this  model  are: 

(1)  Does  not  assume  a finice  number  of  errors. 

(2)  Failure  rate  between  errors  is  constant  (i.e.  time  between  failures 
follows  an  exponential  distribution). 

(3)  Error  correction  is  not  always  successful  (i.e.  the  programmer  intends 
to  decrease  the  failure  rate  in  an  attempted  correction,  but  the 
correction  may  not  be  successful). 

(4)  Failure  rate  declines  probabilistically  as  correction  is  attempted. 

(5)  Failure  rates  follow  a Gamma  distribution. 

Notation: 

X ( i ) = failure  rate  in  the  interval  between  the  detection  (Def.  1) 

of  the  (i-l)St  and  I1*1  errors 

g(*-|i,a)  » Gamma  probability  density  function  of  A(i)  (Def.  2) 

a « a parameter  of  the  Gamma  distribution  of  X(i)  (Def.  3) 

iji(i)  * a growth  parameter  of  the  Gamma  distributions  of  A(i)  (Def.  4) 

The  so-called  Bayesian  Reliability  Growth  model  proposed  by  Littlewood  and 
Verrall  [67]  attempts  to  account  for  error  generation  by  creating  a repair  rule 
which  as  nearly  as  possible  reproduces  the  effect  of  the  programmer's  corrective 
action  on  the  program.  Given  a failure  rate  X(i-l)  as  the  failure  rate  between 
the  detection  of  the  (i-l)St  and  it*1  failures,  the  programmer  intends,  when 
carrying  out  a repair  to  the  i1*1  failure,  to  make  the  program  more  reliable  than 
it  was  before.  In  other  words,  the  programmer  intends  by  his  repair  action  to 
diminish  X,  making  X(i)<  A(i-l)  for  all  i.  However,  he  cannot  be  sure  that  the 
failure  rate  has  diminished;  instead,  it  is  argued  that  it  is  probable  that  this 
has  happened,  i.e. 

P(  A(i)<  t)  > P(  X(i-i)  < {.)  for  all  i.  (Eq.  1) 
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This,  in  essence,  is  a resort  to  the  Bayesian  technique  of  allowing  the 
failure  rate  parameter,  A(i),  to  have  a probability  distribution.  The  probability 
density  function  of  A(i)  is  denoted  by  g(i|i,a)  where  a is  a parameter  or  vector 
of  parameters. 

The  authors  propose  that  failures  be  considered  as  a random  (Poisson)  process, 
that  is,  between  failures  (or  within  small  time  intervals)  the  failure  rate  is 
constant.  This  in  turn  leads  to  an  exponential  density  function  for  the  failure 


times. 


f (t | A)  = Ae 


t > 0 , A > o 
t < 0 , A > 0 


(Eq.  2) 


This  is  a popular  and  reasonable  assumption  and  is  perhaps  the  most  tractable 
mathematically. 

For  the  failure  rates  themselves,  the  authors  select,  as  a "suitable  para- 


metric family  with  a monotonically  arranged  distribution  function",  a family 


of  Gamma  distributions, 

g(<  I i,a  ) 


, a-1  -iKi)i  i > 0 
iHi)[iHi)i  ] e 


(Eq.  3) 


= 0 l > 0 

where  <i(i)  is  a scaling  or  "growth  of  reliability"  factor  which  must  be  a mono- 
tonically increasing  function  of  i.  The  fact  that  ^(i)  is  monotonically  increasing 
with  i guarantees  that 


G(l|i.a)  > G(?.|  (i-l),a)  for  all  i,  e 


(Eq.  4) 


where  G ( ?■  | i , ot ) is  the  distribution  function  of  A(i) 


(Def.  5) 


i.e.  G(P.  | i,o>)  = P(A(i)<e), 


(Eq.  5) 


Hence,  this  guarantees  that 


P (A  (i)<£  ) > P(A  (i-l)<l  ) for  all  i,i 


(Eq.  6) 


In  other  words,  the  nature  of  ii(i)  represents  the  programmer's  intention,  but 
not  certainty,  of  improving  the  program.  This  choice  of  a family  of  Gamma  distri- 
butions for  the  failure  rates,  the  authors  contend,  is  justifiable  by  its  flexi- 
bility (having  two  parameters,  a and  ij/(i))»  correct  range  (0,n),  and  mathematical 


tractability . 
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Giving  X this  prior  (Cairnna)  distribution  is  a standard  practice  in  a Bayesian 
analysis.  The  prior  distribution  is  supposed  to  be  a reflection  of  the  prior 
(i.e.  before  data  is  collected)  beliefs  of  the  experimenter.  After  data  is  collected, 
these  beliefs  are  modified  by  the  data,  and  the  result  is  given  in  the  posterior  dis- 
tribution. However,  the  authors  deviate  from  a usual  Bayesian  analysis  when  they 
also  give  a a prior  distribution.  Putting  the  prior  distribution  on  a seems  to 
make  the  model  a great  deal  more  complicated  and  very  difficult  to  analyze.  Further- 
more, the  prior  distribution  on  a is  assumed  to  be  a uniform  distribution,  and  no 
justification  is  given  for  this  assumption.  Arbitrarily  assigning  uniform  priors 
to  parameters  is  one  of  the  more  criticized  techniques  in  early  Bayesian  works. 

The  point  is  that  it  would  be  extremely  difficult  to  determine  one's  true  prior 
distribution  for  a since  it  is  another  step  removed  from  the  variable  of  interest  X . 

There  is  also  a major  difficulty  in  estimating  the  other  parameter  of  the 
Gamma  distribution,  namely  t|i(i).  The  authors  do  consider  some  estimation  methods 
for  i|>(i)  but  admit  that  it  does  present  some  problems. 

The  authors  do  not  attempt  to  apply  the  model  to  any  real  data,  which  is 
disappointing.  It  should  also  be  noted  that  this  model  was  proposed  in  1973,  and 
to  date,  no  follow-up  work  on  it  has  appeared  in  the  literature.  While  the  model 
does  have  intuitive  appeal  and  appears  applicable  to  many  realistic  situations,  the 
difficulty  in  estimating  a and  ij>(i)  is  too  great  for  the  model  to  be  useful  in  our 
analyses. 


1.6.11  Trivedi  and  Shopman  Markov  Model 

The  basic  model  (Model  I)  has  the  following  assumptions: 

(1)  Software  system  is  large  ( - 10 ^ words  of  code). 

(2)  The  system  initially  contains  an  unknown  number,  n,  of  unknown  bugs. 

(3)  At  most  one  error  is  discovered  at  a given  time. 

(4)  Each  error  is  corrected  before  the  next  error  occurs. 

(5)  Error  detection  and  correction  occur  alternately  and  sequentially. 


(6)  The  probability  of  transition  from  state  i to  state  j depends  only 

on  those  states  and  is  completely  independent  of  all  past  states  except 
the  last  one  (Markov  assumption). 

(7)  The  failure  rate  depends  upon  the  number  of  software  bugs  remaining 
in  or  removed  from  the  system. 

(8)  In  the  interval  of  time  between  error  occurrences  the  failure  rate 
is  constant. 

(9)  The  attempted  error  correction  is  always  successful  in  reducing  the 
number  of  errors  by  one. 

(10)  There  are  two  types  of  states  "up"  or  "down". 


Notation: 

n = the  number  of  initial  errors  in  the  system 


ij 


transition  probability  from  state  i to  state  j 


(n,n-l,n-2, . . . ) = sequence  of  "up"  states 

(m,m-l,m-2, . . .)  = sequence  of  "down"  states 

(£, 4-1, 4-2, .. . ) = sequence  of  "non  critically  down"  states 


8 


n-k 

P . 
m-k 

i 

m-k 

m-k 

n . 
m-k 

*t-k 


^(t)  “ probability  of  being  in  state  (n-k)  at  time  t 

P . (t)  « probability  of  being  in  state  (m-k)  at  time  t 
ni-  k 


(Def.  1) 
(Def.  2) 
(Def.  3) 
(Def.  4) 
(Def.  5) 
(Def.  6) 
(Def.  7) 


error  detection  rate  in  state  (n-k) 
error  correction  rate  in  state  (m-k) 

rate  of  unsuccessful  correction  and  no  new  errors  introduced  (Def.  8) 

rate  of  unsuccessful  correction  with  one  new  error  introduced  (Def.  9) 

non  critical  failure  rate  (Def.  10) 

rate  of  unsuccessful  correction  of  a non  critical  failure  (Def.  11) 
which  introduces  a critical  failure 

(Def.  12) 

(Def.  13) 

A(t)  * availability  of  system  at  time  t (Def.  14) 


Trivedi  and  Shooman  [130-132]  have  proposed  a many-state  Markov  model  for 
the  estimation  and  prediction  of  various  performance  parameters  of  software.  This 
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model  depicts  the  process  by  which  error  detection  and  correction  occurs,  and  is 
used  to  predict  reliability,  availability  and  the  number  of  errors  that  will  have 
been  corrected  at  a future  time.  The  model  makes  no  attempt  to  estimate  or 
describe  the  error  detection  rate  or  the  error  correction  rate.  On  the  contrary, 
these  are  required  as  inputs  to  the  model. 

Trivedi  and  Shooman  begin  by  assuming  that  the  software  system  is  large  and 
contains  a fixed  number,  n,  of  errors  initially  (at  time  t=0) . The  most  meaningful 
time  origin  for  the  model  would  be  the  start  of  Phase  II,  where  the  program  runs, 
at  least  briefly,  between  errors.  It  is  further  assumed  that  errors  occur  and  are 
corrected  alternately  and  sequentially.  In  the  basic  version  of  the  model,  all 

* 

errors  are  corrected  successfully. 

The  software  system  can  be  in  either  of  two  states:  "up"  or  "down".  The 
system  is  in  an  "up"  state  if  no  error  has  occurred  or  if  an  error  has  just  been 
repaired  (i.e.the  system  is  operable).  The  sequence  of  "up"  states  is  denoted  by 
(n,  n-1,  n-2,...)  where  the  state  number  can  be  thought  of  as  the  number  of  errors 
remaining  in  the  system.  Similarly,  a "down"  state  is  one  in  which  an  error  has 
occurred  and  is  being  corrected,  and  the  sequence  of  down  states  is  denoted  by 

s t 

(m,  m-1,  m-2,...).  In  general,  the  system  will  be  in  state  (n-k)  if  the  (k-1) 
error  has  been  corrected  and  the  k^  error  has  not  yet  occurred,  while  the  system 
* will  be  in  state  (m-k)  after  the  k*"*1  error  has  been  discovered  but  not  yet  corrected 

(k=0,l,2, . ., n) . The  error  occurrence  rate  while  in  state  (n-k)  is  denoted  and 

is  a function  of  the  number  of  errors  remaining  in  the  system.  Similarly,  the 
error  correction  rate  while  in  state  (m-k)  is  u 

Trivedi  and  Shooman  next  invoke  the  basic  Markov  assumption  that  the  probability 
of  transition  from  state  i to  state  j is  dependent  only  on  those  states  and  is  inde- 
pendent of  all  past  states  except  the  last  one.  The  probability  of  transition  from 
state  (n-k)  to  state  (m-k)  is 
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Similarly,  the  transition  probability  from  state  (m-k)  to  state  (n-k+1)  is 

P , ...  = b .At  (See  Figure  14)  (Eq.  2) 

m-k, n-k+1  m-k 

The  entire  set  of  these  transition  probabilities  defines  a discrete-state,  con- 
tinuous time  Harkov  process. 

If  we  now  let  P^(t)  be  the  probability  that  the  system  is  in  state  i at 

time  t,  then  the  availability  of  the  system  is  simply  the  probability  that  the 

system  is  in  any  "up"  state  at  t.  That  is 

n 

A(t)  = £ P (t)  (Eq.  3) 

k=0  n“K 

In  order  to  determine  these  "state  occupancy  probabilities",  P^(t),  we  must  solve 
the  following  system  of  differential  equations  (which  can  be  developed  by  inspection 
of  Figure  14): 


n(t)  = 

Xn  Pn(t) 

n n 

(Eq. 

4) 

n-k^  + 

* P 

n-k  n- 

■k(t)  Um-k+l(t)  ’ k 1,2,,-’n 

(Eq. 

5) 

m-k(t>  + 

V . P 

m- 

■k(t)  = *n-k  Pn-k(t)  ; k=0»1*2»--»n 

(Eq. 

6) 

subject  to  these  conditions: 


n (0) 

= 1; 

P (0)  = 0 
m 

(Eq. 

7) 

o 

c 

= 0; 

k 

(Eq. 

8) 

m-k(0) 

= 0; 

k = 1, 2, . . ,n 

(Eq. 

9) 

Trivedi  and  Shooman  next  present  techniques  for  the  exact  analytical  solution 
and  the  numerical  solution  of  this  system.  The  reliability  of  the  software  system, 
of  course,  depends  on  the  stage  of  debugging  of  the  system  (i.e.,  the  number  of 
bugs  remaining  in  the  system).  After  the  k*"^1  bug  has  been  removed  and  the  system 
is  in  state  (n-k),  the  error  occurrence  rate  is  X^  a function  of  k,  and  is 


constant.  Thus,  the  reliability  will  be 

Rk(T)  = exP(“Xn-kT) 


(Eq.  10) 


where  t is  the  time  since  the  correction  of  the  k error.  Hence  the  expected 


value  of  the  reliability  at  some  future  time  t and  a duration  t is: 
R(t,T)  - exp  ( -X^)  (t) 


(Eq.  11) 


* 

It 


This  Markov  model,  then,  provides  estimates  of  the  state  occupancy  prob- 
abilities which,  in  turn,  can  be  used  to  estimate  and  predict  the  most  probable 
number  of  errors  that  will  have  been  corrected  at  any  time  t,  based  on  preliminary 
modeling  of  the  error  occurrence  and  repair  rates,  1 and  P.  Equivalently,  this 
means  that  we  can  estimate  the  most  probable  state(s)  of  the  system  at  some  future 
time  t.  Given  this  information,  we  can  then  predict  the  availability,  A(t),  and 
the  reliability  R(t,r),  for  a given  duration  t at  a future  time  t. 

Up  to  this  point  we  have  discussed  in  detail  only  the  basic  model  (Model  I). 
The  basic  model  is  characterized  by  the  assumptions  that  X and  P depend  only  on  k, 
that  the  error  correction  is  always  successful,  and  that  there  are  only  two  states. 
Trivedi  and  Shooraan  also  present  a number  of  variations  of  the  basic  model  in  [131] 
and  [132].  All  these  variations  have  the  same  assumptions  as  the  basic  model 
except  the  three  just  mentioned,  which  are  altered  in  some  form  or  other.  These 
variations  are  named  Model  II,  Model  I-G,  Model  II-G,  Model  I-H,  and  Model  II-H. 
They  all  provide  some  increase  in  generality  and  thus  flexibility  over  the  basic 
model.  They  will  now  be  summarized. 

Model  II 

Model  II  has  the  following  assumptions: 

(1)  The  error  occurrence  rate  X and  the  error  repair  rate  P are  both 
explicit  functions  of  t. 

i.e.  X = X(t)  (Def.  15) 

P = P(t)  (Def.  16) 

(2)  The  attempted  correction  of  code  is  always  successful. 

(3)  There  are  two  types  of  states  "up"  and  "down". 

A diagram  of  Model  II  is  given  in  Figure  15. 

Model  I-G 

Model  I-G  has  the  following  assumptions: 

(1)  X = X = X (k);  P = P , = P (k)  (Def.  17) 

n-k  m-k 
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(2)  One  of  three  things  can  happen  when  a correction  is  attempted: 


(i)  the  bug  can  be  successfully  repaired, 

(ii)  the  attempted  correction  of  code  is  unsuccessful  so  that  the 

total  number  of  bugs  remaining  is  unchanged,  even  though  the 

system  enters  an  "up"  state.  The  probability  of  this  event 

occurring  (i.e.  of  transmitting  from  state  (m-k)  to  state  (n-k)) 

in  the  interval  of  time  (t,  t + At)  is  a . At, 

m-k 

(iii)  the  attempted  correction  of  code  may  be  unsuccessful  and  actually 

introduces  one  new  error.  The  probability  of  this  event  occurring 

(i.e.  of  transmitting  from  state  (m-k)  to  (n-k+1)  in  the  interval 

of  time  (t,  t + At)  is  6 , At. 

m— k 

(3)  There  are  two  types  of  states:  "up"  and  "down". 

A diagram  of  Model  I-G  is  given  in  Figure  16. 

Model  II-G 

This  model  is  identical  to  Model  1-G  except  that  the  error  detection  and  repair 
rates  are  explicit  functions  of  time. 

Model  I-H 

Model  I-H  has  the  following  assumptions: 

(1)  X = A = A(k)  ; v = p . = P(k).  (Def.  18) 

n-k  m— k 

(2)  ‘The  attempteu  error  correction  is  always  successful. 

(3)  There  are  three  types  of  states:  "up",  "down",  and  "non  critically 

down"  (degraded)  (4, 4-1, 4-2, . . . . ) . The  probability  of  a non  critical 

failure  is  Y ^ At.  The  probability  that  a correction  is  only  partially 

successful  is  n .At.  The  probability  that  the  correction  of  a 
m—  k 

non-critical  failure  induces  a critical  failure  is  (*>f  ^ At. 

A diagram  of  Model  I-H  is  given  in  Figure  17. 
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FIG.17- MARKOV  MODEL  I- 


Model  II-H 


This  model  is  identical  to  Model  I-H,  except  that  the  error  detection  rate  and 
the  error  correction  rate  are  explicit  functions  of  t. 

Conclusions 

This  model  has  a number  of  strong  points: 

(1)  It  provides  estimates  of  reliability  and  availability. 

(2)  It  is  compatible  with  many  of  the  assumptions  of  the  other 
analytical  models. 

(3)  It  mathematically  describes  the  error  behavior. 

(4)  It  realistically  describes  the  entire  debugging  process  when 
the  detection  rate  (A)  and  the  correction  rate  (p)  are  known. 

(5)  It  may  be  able  to  handle  a multiple  state  condition  for  an 
operational  software  system  if  the  transition  probabilities 
are  known. 

Unfortunately,  the  model  has  one  key  flaw.  It  presents  no  method  for 
determining  the  failure  rate  (A)  of  the  software  package.  In  fact,  it  even 
requires  the  error  detection  rate  (as  well  as  the  error  correction  rate)  as  an 
input.  Our  effort  is  to  actually  find  a method  for  accurately  determining  the 
failure  rate  of  the  software.  Therefore,  we  do  not  feel  the  Markov  model  is 
applicable  to  our  current  analysis. 

Downstream  it  may  be  possible  to  use  this  model  in  conjunction  with  other 
analytical  models  which  provide  A and  p so  that  reliability  and  availability 
can  be  determined. 

1.6.12  Basic  Schlck-Wolverton  Model 

The  Schick-Wolverton  model,  developed  by  George  Schick  and  Ray  Wolverton, 
modifies  the  Jelinski-Moranda  de-eutrophication  model  by  changing  the  assumption 
about  the  behavior  of  errors.  Initially,  a listing  and  discussion  of  the  model 
assumptions  will  be  presented,  followed  by  the  equations  for  parameter  estimation 
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and  finally,  application  of  this  model  to  software  error  data. 

Below  is  a listing  of  the  assumptions  Schick  and  Wolverton  make  regarding 
their  model. 

(1)  There  is  a fixed  number  of  errors  in  the  program. 

(2)  No  new  errors  are  added  during  debugging. 

(3)  The  amount  of  debugging  time  between  error  occurrences  has  a 
Rayleigh  distribution. 

(4)  The  error  rate  is  proportional  to  the  number  of  errors  remaining  and 
the  time  spent  in  debugging. 

(5)  Each  error  discovered  is  immediately  removed. 

As  mentioned  above,  the  basic  Schick-Wolverton  model  assumes  that  the  error 
detection  rate  is  proportional  to  both  the  number  of  errors  remaining  in  the 
software  and  the  time  spent  in  debugging  since  the  detection  of  the  most  recent 
error  (Figure  18).  This  implies  that  between  errors,  the  failure  rate  actually 
increases  with  time,  using  the  rationale  that  the  program's  inputs  gradually  close 
in  on  the  remaining  errors  [86].  Under  this  assumption,  the  time  between  error 
occurrences  has  a Rayleigh  distribution.  However,  we  do  not  consider  this  to  be 
a valid  assumption,  except  perhaps  in  the  early  stages  of  debugging  (Figure  1). 

Beyond  Phase  I however,  we  do  not  believe  that  the  failure  rate  would  be  linearly 
increasing  with  time.  Furthermore,  the  Schick-Wolverton  model  indicates  that  the 
hazard  rate  is  zero  just  after  the  detection  of  an  error,  which  is  also  questionable. 

The  hazard  function,  given  below,  is  identical  to  that  given  in  the  Jelinski- 

Moranda  model,  with  the  exception  being  that  t^  (the  time  between  error  occurrences) 

is  included  in  the  Schick-Wolverton  model. 

Z(tt)  - 0 iN-(i-l) ] tt  (Eq.  1) 

where  t^  * the  i—  time  interval  between  detection  of  the  (i-1)—  (Def.  1) 

, . th 

and  i — errors. 

N » the  number  of  initial  errors  present  in  the  system.  (Def.  2) 
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ERROR  RATE 


FAILURE  RATE  PROPORTIONAL  TO  THE  NUMBER  OF 
REMAINING  ERRORS  AND  THE  TIME  SINCE  THE 
DETECTION  OF  THE  LAST  ERROR 


i = the  number  of  corrected  errors  (equals  the  number  of  detected  (Def.  3) 


errors) . 

0 = a proportionality  constant 

In  [138],  Schick  and  Wolverton  state  the  maximum  likelihood  for  N 
the  model  parameters,  as  being: 

» 2n  n 2 1 

N = [^r-  z (i-1)  tf]  n 1 2 

0 i-1  1 S t2i 

1=1 

0 = ^ N-(i-l)  1 n t2  (Eq-  3) 

i-1  1 

where  n = the  total  number  of  intervals  (Def.  5) 

Then,  the  mean  time  to  failure  (MTTF)  and  reliability  estimates  may  be  derived. 

In  [111],  Shooman  gives  the  general  equations  used  in  calculating  MTTF  and  R(t^) 
for  a model  with  a linearly  increasing  hazard  rate.  Thus,  for  the  Schick-Wolverton 
model,  the  aforementioned  estimates  may  be  obtained  through  equations  4 and  5. 


MTTF  •/  ieild.-nr  <Ei-  *> 

R(ti)  = exp  {-0 [N-(i-l) ] t2/2 } (Eq.  5) 

This  particular  model,  as  with  many  of  the  other  software  models,  requires 

as  inputs,  the  sequence  of  times  between  errors  (i.e.  t.  , t_,  t_, ,t  ). 

1 l j n 


Software  error  data  has  been  and  continues  to  be  available  only  on  a limited 
basis.  However,  this  model  was  applied  to  two  sets  of  data  which  are  discussed 
below. 

The  first  data  set  was  extracted  from  a report  published  by  Honeywell  [27]. 
This  small  data  set  was  in  the  required  form  of  times  between  error  occurrences. 

The  following  figure  describes  this  set  of  error  data. 


(Def.  4) 
and  0, 

(Eq.  2) 
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Flight  Number 


Time  Between  Errors  (t^) 

Cumulative  Time 

4.0 

4.0 

2.2 

6.2 

10.1 

16.3 

12.6 

28.9 

42.2 

71.1 

Fig.  19  Honeywell  Flight  Test  Data 


Knowing  the  t^'s,  it  was  possible  to  solve  equations  2 and  3 for  N and  0 
respectively.  Doing  so  yields 

N = 4.0620532 

A 

0 = 0.017566 

Now  it  becomes  possible  to  determine  how  accurate  this  model  is  in  pre- 
dicting MTTF.  As  a sample  calculation,  suppose  we  select  the  point  i=3,  which 
corresponds  to  flight  number  8.  Substituting  the  previously  calculated  values 

A A 

for  N and  0 into  the  hazard  function  (equation  1)  and  knowing  t^=  10.1,  gives: 
Z(t3)  = 0.017566  [A. 0620532-  (3-1)]  10.1 

Z(t3)  « 0.36584 

To  measure  the  accuracy  of  this  model  we  apply  equation  4 as  follows: 
MTTF  -/  3.14159 


MTTF  - / 3.14159 

2(0.017566)  [4.0620532  - (3-1)] 

MTTF  - 6.5853 

This  same  procedure  may  be  applied  for  any  point  i ( * 1,2, 3,4, 5 in  this 
case).  The  predicted  values  are  then  compared  with  the  MTTF's  in  Figure  20. 

The  basic  Schick-Wolverton  model  was  compared  against  other  models  requiring 
the  same  inputs  (t^).  The  basic  Schick-Wolverton  does  not  give  as  good  a pre- 
diction as  the  basic  Jelinski-Moranda,  yet  it  has  the  same  disadvantages  as  the 
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S-W.J-M  AND  GEOMETRIC  - POISSON  MODEL  RESULTS 
(HONEYWELL  FLIGHT  TEST  DATA) 


FIG.  20 


basic  Jelinski-Moranda.  That  is,  it  does  not  allow  for  errors  introduced  during 
debugging.  The  Geometric  does  allow  for  this,  and  the  basic  Schick-Wolverton  does 
not  predict  any  better  than  the  Geometric  model.  In  fact,  the  mean  percent  error 
is  slightly  less  for  the  Geometric.  In  addition,  the  basic  Schick-Wolver ton  never  gives 
the  best  estimate.  Therefore,  both  the  basic  Jelinski-Moranda  model  and  the 
Geometric  model  are  superior  to  the  basic  Schick-Wolverton  model  as  applied  to 
this  data  set. 

Dr.  Paul  Moranda,  in  [80],  also  applied  this  model  to  data  from  the  "F-11D 
program".  Not  only  was  this  data  set  larger  than  the  Honeywell  data,  but  it 
was  also  recorded  in  CPU  time  which  is  much  more  desirable  than  calendar  time. 

Moranda  states  "there  is  usually  a startup  effect  which  is  evident".  With  this 
particular  data  set  erratic  error  behavior  exists  until  the  sixth  day  of  testing. 

Only  then  does  a decreasing  failure  rate  become  evident.  Also,  at  this  point, 
there  are  67  errors  remaining  in  the  software.  As  most  of  the  details  are  given 
in  [80],  it  should  suffice  to  state  the  following: 

N = 41.5 

i/1  = 0.2192 


Considering  there  are  67  known  errors  remaining,  N = 41.5  indicates  a 38%  "error" 
in  the  prediction  process.  Figure  21  below,  compares  the  results  of  three  models 
as  applied  to  the  "F-11D  program". 


Mode  l 

Basic  Schick-Wolverton 

Basic  Jelinski-Moranda 

Geomet  ric-Poisson 

Est.  //  of 

Rem.  Errors 
(Actual  = 67) 

41.5 

63.4 

62.73 

Parameter 

Estimates  J 

/S  - 0.2192 
= 41.5 

J «5  - 0.035 

In  « 63.4 

j K = 0.6756 

) X = 20.348 

"Error" 

38.1% 

5.4% 

6.4% 

Fig.  21  Comparison  of  Three  Models  Applied  to  F-11D  Data 


Again,  as  with  the  Honeywell  data,  the  basic  Schick-Wolverton  mode]  yields 
poorer  results  than  the  other  models.  Hence,  based  on  these  two  applications 
we  feel  that  the  basic  Schick-Wolverton  will  not  yield  acceptable  results  in 


i 


» 


J 
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our  particular  analysis. 

1.6.13  Extended  Schick-Wolverton  Model 

The  extended  Schick-Wolverton  model,  suggested  by  Myron  Lipow  of  the 
TRW  Defense  and  Space  Systems  Group,  is  very  similar  to  the  basic  Schick- 
Wolverton  model.  The  most  notable  change  is  that  this  extension  allows  for 
more  than  one  error  in  a given  debugging  period.  Thus,  the  required  inputs 
are  the  number  of  errors  per  interval  where  each  debugging  interval  is  of 
equal  length  (time).  Otherwise,  the  model  assumptions  remain  the  same  as 
with  the  basic  Schick-Wolverton  model.  The  reader  may  refer  back  to  the 
basic  model  for  such  information. 

The  hazard  function  in  this  case  is  of  the  form: 

Z(tt)  = 0 [N-(n1_1)]  t± 

where  t ^ = the  i—  debugging  interval 
0 = a proportionality  constant 

N = the  total  number  of  initial  errors  in  the  system 

n^_^  = the  cumulative  number  of  errors  encountered 

s t 

through  the  (i-1)  time  interval 

As  can  be  seen,  the  hazard  rate  proposed  for  this  model  is  nearly  identical 
to  that  given  with  the  basic  Schick-Wolverton  model.  Here  n^  ^ replaces  i-1, 
since  errors  per  interval  are  the  inputs  rather  than  the  times  between  errors. 

The  model  parameters,  N and  0,  are  estimated  through  the  utilization  of  the 
following  equations: 

" mi  = “ 0 tj  / 2 (Eq.  2) 

i=l  N_ni-1  i=l 

Note:  In  reference  [124],  Capt.  Sukert  omitted  the  0 in  equation  2. 


(Eq.  1) 

(Def.  1) 
(Def.  2) 
(Def.  3) 
(Def.  4) 
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(Eq.  3) 


-s-  • ‘"-"i-i1  n ' 2 

0 1=1 

Now,  if  equation  3 is  solved  for  0 and  substituted  into  equation  2, 


we  obtain: 


i=l  N_ni-1 


N - i=l 


ni-l  t2i  / 2 


t t2  / 2 


(Eq.  4) 


where 


the  total  number  of  intervals 


(Def.  5) 


nu  = the  number  of  errors  found  in  the  i — time  interval  (Def.  6) 


n = the  total  number  of  errors  found  to  date 


(Def.  7) 


Then,  similar  to  the  basic  Schick-Wolverton  model,  MTTF's  and  R(t^)  may  be 


calculated  from  the  following  expressions: 


2 0 (N-n^) 


R(tj)  = exp  (-  0 (N-n^j)  t^  / 2 ) 


(Eq.  5) 


(Eq.  6) 


As  previously  mentioned,  m^,  the  number  of  errors  per  interval  (i,e.  m ,mo,  m^j.-.m^) 
is  a required  input  in  order  to  apply  this  extended  model. 

The  acquisition  of  two  sets  of  data  in  the  proper  format  made  it  possible 
to  apply  the  extended  Schick-Wolverton  model.  First,  a small  set  of  data,  from 
Dickson  et.al  [30],  was  tested.  The  weaknesses  associated  with  this  data  are 
that  the  set  is  quite  small;  however,  a second  limiting  factor  is  that  the 
intervals  were  in  calendar  time  as  opposed  to  CPU  time. 

Below,  Figure  22,  is  a listing  of  the  errors  per  month  and  the  cumulative 

•* 

number  of  errors  to  date  found  in  reference  [30], 


Month 

Errors /Month  (m  ) 
l 

Cumulative  Number  of 
s t 

Errors  Thru  (i-1)  Interval 

(ni-l} 

1 

520 

0 

2 

430 

520 

3 

300 

950 

4 

170 

1250 

5 

120 

1420 

6 

60 

1540 

7 

40 

1600 

Fig.  22  Dickson  et.al.  Software  Error  Data 


m 

n = I m.  = 1640,  m = 7 
i=l 

To  estimate  the  model  parameters,  equation  4 must  be  solved.  Doing  so  will 
yield  a value  for  N.  Using  the  m^  and  n^_^  given  in  Figure  22  produces 

N = 1738.6 

Now,  applying  equation  2 or  3 we  find 

0 = 0.6708571 

Note  that  1640  errors  have  been  found  to  date  and  the  model  predicts  approxi- 
mately 1739  errors  overall.  We  feel  that  this  is  a fairly  reasonable  estimate. 

However,  one  notable  shortcoming  of  this  model  becomes  evident  when  the 
hazard  function  is  applied.  The  sample  calculations  below.  Figure  23,  and 
Figure  24  will  amplify  this  deficiency. 

Suppose  we  wish  to  estimate  the  number  of  errors  in  interval  (month)  4, 
which  has  been  selected  arbitrarily.  Substituting  the  known  values  into 
equation  1 (the  hazard  function)  will  give 

Z(t,)  - 0.6708571  (1738.6  - 1250)1 

4 

Z(t4)  - 327.78 
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COMPARISON  OF  THE  EXTENDED  J-M  .GEOMETRIC-POISSON  AND  EXTENDED  S-W  MODELS 

(DICKSON  ET.  AL.  DATA) 
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FIG.  23 


COMPARISON  OF  THE  EXTENDED 
J-M,  GEOMETRlC-POISSON  AND 
EXTENDED  S-W  MODELS 
(DICKSON  et.  Ol.  DATA) 


— ACTUAL  DATA 

— GEOMETRIC  POISSON 
--  EXTENDED  J-M 

— EXTENDED  S-W 


Comparing  the  predicted  error  count  (327.78)  with  the  actual  value  (170) 


shows  the  inaccuracy  of  the  monthly  predictions.  The  93%  variation  is  extremely 
large  and  Figure  23  gives  the  remaining  predictions  and  the  respective  "error" 
(variation) . 

One  puzzling  point  is  that  the  model  predicted  1738.6  total  errors  in  the 
system,  however,  summing  the  monthly  estimates  gives  a total  of  3280.63.  This 
alone  causes  us  to  suspect  the  adequacy  of  the  model. 

Both  Figure  23  and  Figure  24  clearly  show  that  the  Geometr ic-Poisson 
and  the  extended  Jelinski-Moranda  models  are  much  more  accurate  than  the  ex- 
tended Schick-Wol verton  in  the  monthly  prediction  process. 

The  second  set  of  data  that  this  model  was  applied  to  was  obtained  from 
reference  [134].  In  this  case,  errors  were  given  in  CPU  time  (seconds)  and 
10  second  intervals  were  used  for  t^  which  required  interpolation  to  find  the 
number  of  errors  per  interval.  Figure  25  below  lists  the  number  of  errors  per 
interval  as  well  as  the  cumulative  number  of  errors  discovered. 


Interval 

Errors/Interval  (m^) 

Cumulative  Number  of  Errors 
Thru  (i-l)St  Interval  (n^  ^) 

0-10  sec. 

19.53 

0 

10-20  sec. 

12.83 

19.53 

20-30  sec. 

10.93 

32.36 

30-40  sec. 

7.52 

43.29 

40-50  sec. 

4.58 

50.81 

50-60  sec. 

1.37 

55.39 

Fig.  25  F-11D  Error  Data  (Errors  Per  Interval) 

m 

n = j;  m = 56.76,  m**6 

i=l 


113 


Moranda,  in  [80],  mentions  that  the  errors  behave  in  an  erratic  manner  early 
(up  through  1/18)  and  attributes  this  to  the  startup  effect  of  the  program. 

Thus,  the  data  used  in  the  model  application  begins  with  day  1/19.  The  above 
table  takes  the  aforementioned  statements  into  account. 

Now,  once  again  estimates  for  N and  0 must  be  obtained.  Utilizing  the 
same  procedure  as  was  used  with  the  Dickson  et.al.  data  we  find 

N =■  62.2 

and 

0 = 0.6607 

To  date  56.76  errors  have  been  found  and  the  extended  Schick-Wolverton 
model  predicted  62.2.  Again,  this  estimate  appears  to  be  fairly  accurate. 

Figures  26  and  27  were  derived  upon  application  of  the  hazard  function  for 
this  model.  For  example,  if  we  want  to  find  the  estimated  number  of  errors 
in  the  second  interval,  we  find  (by  equation  1) 

Z(t2)  = 0.6607(62.2  - 19.53)  1 

Z(to)  = 28.19 

Figure  26  indicates  that  this  estimate  is  very  inaccurate  when  compared  with 
12.83  (the  actual  value).  As  previously  mentioned,  the  extended  Schick- 
Wolverton  predicted  62.2  errors  (after  day  1/18);  however,  summing  the  six 
intervals  gives  113.53  errors.  We  have  no  explanation  as  to  why  this  incon- 
sistency occurs.  Also,  as  with  the  other  data  set  (Dickson  et.al.),  again, 
the  extended  Schick-Wolverton  model  was  inferior  to  both  the  Geometric-Poisson 
and  the  extended  Jel inski-Moranda  models.  Our  conclusion  is  that  this  particular 
model  does  not  yield  accurate  results  when  predicting  the  number  of  errors  per 
interval.  Thus,  this  model  has  been  eliminated  from  further  consideration 
concerning  our  software  analysis  study. 

1.6.14  Modified  Schick-Wolverton  Model 

The  modified  Schick-Wolverton  model,  a slight  variation  of  the  extended 
Schick-Wolverton  model,  was  suggested  by  Lipow  and  implemented  by  Sukert  [123,124], 
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COMPARISON  OF  THE  EXTENDED  J-M, 
GEOMETRIC-POISSON  S EXTENDED  S-W 
WITH  ACTUAL  DATA 
(WAGONER'S  DATA-  F-11D  PROGRAM) 

ACTUAL  DATA 

GEOMETRIC  POISSON 

EXTENDED  J-M 

EXTENDED  S-W 
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EACH  INTERVAL  = 10  CPU  SECONDS 

^ 

7 

FIG.  27 

< 

116 

r 1 

i f \ 

0 

. J 

The  assumptions  of  the  modified  Schick-Wolverton  model  are  listed  and 


discussed  below. 

(1)  There  is  a fixed  number  of  errors  in  the  program. 

(2)  No  new  errors  are  added  during  debugging. 

(3)  The  error  rate  is  constant  during  a time  interval  and  is  pro- 

s t 

portional  to  the  number  of  errors  remaining  following  the  (i-1) 
time  interval  and  the  total  debugging  time  previously  spent,  in- 
cluding an  "averaged"  error  search  time  during  the  current  time 
interval. 

(4)  Each  error  discovered  is  immediately  removed. 

The  major  difference  with  the  modified  version  is  assumption  3.  It  is  assumed 

that  the  error  discovery  rate  is  constant  during  a time  interval  and  is  propor- 

s t 

tional  to  the  number  of  errors  remaining  following  the  (i-1)  time  interval 
and  the  total  time  previously  spent  in  testing,  including  an  "averaged" 
error  search  time  during  the  current  time  interval  t^  (Figure  28).  We  strongly 
question  this  assumption,  which  implies  that  the  error  detection  rate  increases 
with  the  time  spent  in  searching  for  the  i^  error. 

Furthermore,  we  question  the  relationship  which  indicates  that  the 
failure  rate  increases  as  a function  of  total  time  spent  in  debugging.  As  can 
be  seen  in  Figure  28,  this  could  result  in  an  increasing  failure  rate  when  the 
number  of  detected  failures  is  relatively  low.  It  should  be  noted  however  that 
as  the  number  of  detected  errors  becomes  large,  the  error  rate  between  intervals 

tends  to  decrease,  and  within  intervals,  the  rate  of  increase  of  the  failure 

! 

rate  with  time  becomes  smaller,  approaching  a constant  rate  within  the  interval. 
However,  the  hazard  function  for  the  modified  Schick-Wolverton  model  is 


given  as: 


Z(tt)  = 0 (N-n1_1)  (Ti_1  + t±/2) 


(Eq.  1) 


Note:  The  formulation  of  the  modified  Schick-Wolverton  model  is  described  by 
Sukert  in  references  [123]  and  [124].  In  reference  [123],  0,  the  proportionality 


i 
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THE  "AVERAGE"  TIME  SINCE  THE  DETECTION  OF  THE  LAST  ERROR 


constant,  was  omitted  from  the  above  hazard  function.  However,  reference  [124] 

I 

does  include  0 in  the  aforementioned  hazard  function  for  this  model.  Thus,  it 
is  our  conclusion  that  the  deletion  of  0 from  the  hazard  function  in  [123]  was 
merely  a typographical  error. 

Sukert  also  stated  that  the  hazard  rate  between  errors  in  this  modified 
version  of  the  model  is  constant,  which  is  inconsistent  with  any  plausible 
interpretation  of  the  hazard  function  for  this  model. 

In  any  event,  the  notation  for  the  hazard  function  is  defined  below. 


tj  = the  I1*1  debugging  interval 
0 = a proportionality  constant 

N = the  total  number  of  initial  errors  in  the  system 
n^_^  = the  cumulative  number  of  errors  observed  up  through 


s t 

the  (i-1)  interval 


= the  cumulative  debugging  time  through  the  (i-1) 


st 


(Def.  1) 
(Def.  2) 
(Def.  3) 
(Def.  4) 

(Def.  5) 


interval 

Again,  N and  0 are  the  model  parameters  and  may  be  estimated  by  the 
following  equations: 


m m. 

Z — 1 

i-1  N-ni_i 


" ^ 'i  <Ti-l  + 'l/2> 


(Eq.  2) 


Note:  0 was  omitted  from  the  above  equation  by  Capt.  Sukert  in  reference  [124], 


IT  - ci  (Ti-i  + ci/2) 


(Eq.  3) 

Now,  solving  equation  3 for  0 and  substituting  back  into  equation  2,  yields 

(Eq.  4) 


m m. 

r 

m 

i-1  ni-l  Ci/2 

i-i  N-"i-i 

? t2/2 

L J 

L i-i  1 J 

where  m = the  total  number  of  intervals 


, th 


nK  * the  number  of  errors  found  in  the  i time  interval 
n - the  total  number  of  errors  found  to  date 


(Def.  5) 
(Def.  6) 
(Def.  7) 


J 


119 


Next,  MTTF  and  R(t^)  may  be  computed  using  the  following  formulas: 

oo 

MTTF  = R(tt)  dt.  (Eq.  5) 

R(t.)  = exp  ( - 0 (N-n)  (T._1  + t*  /4)  ) (Eq.  6) 

Since  the  modified  Schick-Wolverton  model  requires  the  same  inputs  as 
the  extended  Schick-Wolverton  model  (nu)  we  will  apply  this  model  to  the  same 
two  data  sets. 

First,  the  Dickson  et.al.  data  set  is  given  in  Figure  22  (under  extended 

A A 

Schick-Wolverton).  Using  equations  2-4  we  may  calculate  N and  0.  The  modified 
Schick-Wolverton  model  yields  the  following  estimates: 

A 

N = 1613.85 

and  * 

0 = 0.2430 

A 

Immediately,  it  is  evident  that  N (1613.85)  is  not  a very  good  estimate 
since  1640  errors  have  already  been  found.  Again,  as  with  the  extended  Schick- 
Wolverton,  we  find  that  summing  the  monthly  predictions  yields  a value  in  excess 
of  N (the  total  number  of  predicted  errors  in  the  system).  Figure  29  and  Figure 
30  show  the  results  when  the  modified  model  is  applied  to  the  previously  dis- 
cussed data  set. 

Again,  a sample  calculation  is  presented  to  show  the  procedure  for 
estimating  the  number  of  errors  in  an  interval.  For  month  7 equation  1 
gives 

Z(t.;)  = 0.2430  (1613.85  - 1600)  (6  + 1/2) 

Z(tJ  = 21.878 

Figure  29  gives  the  predicted  values  for  all  intervals  and  compares  them  with 
the  actual  values  Also,  the  extended  Jelinski-Moranda  and  Geometric-Poisson 
models'  results  are  compared  with  those  obtained  from  the  modified  Schick- 
Wolverton  model.  As  can  be  seen,  the  results  of  the  modified  model  are  not 
encouraging. 
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EXTENDED  J-M, MODIFIED  S-W  AND  GEOMETRIC-POISSON  MODEL  RESULTS 

(DICKSON  ET  AL  DATA) 


The  F-11D  data,  given  in  reference  [134]  was  also  applied  and  is  con- 
sidered to  be  good  data  since  CPU  seconds  was  the  time  scale  as  opposed  to 
calendar  time.  Figure  25  (also  given  in  section  1.6.13)  gives  the  necessary 
inputs  for  this  model.  The  equations  for  parameter  estimation  (Equations  2-4) 
were  used  to  obtain 

A 

N = 56.0244 

f . 

A 

0 = 0.26593 

It  should  be  noted  that  45.76  errors  have  already  been  found  and  this 
model  predicts  a total  of  56.0244.  This  estimate  is  not  as  good  as  it  initially 
appears,  and  Figures  31  and  32  are  obtained  when  we  apply  the  hazard  function  to 
all  of  the  intervals.  Once  again,  this  model  is  much  less  accurate  than  either 
the  Geometric-Poisson  or  the  extended  Jelinski-Moranda  models'  estimates.  Our 
conclusion  here,  as  with  the  extended  Schick-Wolverton  model,  is  that  this 
particular  model  does  not  appear  to  model  the  error  process  too  well;  hence  we 
see  little  reason  in  further  pursuing  it. 

1.6.15  LaPadula  Reliability  Growth  Model 

This  model  was  developed  in  [61]  and  is  an  outgrowth  of  a hardware 
reliability  growth  model.  It  makes  the  following  assumptions: 

(1)  A test  sequence  is  conducted  in  N states,  where  a stage  terminates 
whenever  a change  (of  any  kind)  is  made  to  the  program. 

(2)  The  change  c..n  be  either  a correction  to  or  a corruption  of 
the  program. 

(3)  The  number  of  tests  per  stage  of  testing  is  not  fixed. 

(4)  The  number  of  stages  in  the  test  sequence  is  not  fixed. 

(5)  At  any  particular  stage  of  testing,  there  exists  an  upper  bound 

on  the  reliability  ot  a piece  of  software,  and  this  upper  bound 

may  very  well  be  significantly  less  than  one. 
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FIG.  31 


{10  CPU  SECONDS  EACH) 


FIG  32 
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The  only  data  recorded  is  whether  the  program  was  successful  or  failing 
in  each  test  during  each  testing  state.  Thus  for  the  kt*’  stage  there  will  be 

n^  tests  and  s^  successes,  for  1 “L  k "i.  n and  k an  integer. 

The  growth  function  is  given  as: 

R(k)  = R (u)  - -A 
k 

where  R(k)  is  the  actual  reliability  during  the  k*"'1  stage,  R(u)  is  the  ultimate 
value  of  reliability  as  k goes  to  infinity,  and  A is  a shape  parameter  for  the 
growth  rate.  R(u)  and  A can  be  estimated  by  the  method  of  least  squares  or  by 
maximum  likelihood.  The  latter  proves  to  be  the  more  precise.  Then  R(k)  can  be 
determined  for  any  value  of  k desired.  These  values  of  R(k)  will  demonstrate 
the  growth  of  the  reliability  as  it  approaches  R(u). 

There  appear  to  be  a number  of  problems  associated  with  this  model  which 
prohibit  its  use  for  our  purposes.  First  of  ail,  no  continuous  time  scale  is 

utilized  in  the  data  collection;  only  the  discrete  stage  numbers  are  used. 

This  results  in  the  reliability,  R(k),  being  a function  of  these  stage  numbers, 
and  the  stages  themselves  can  be  of  any  arbitrary  length.  Ordinarily,  re- 
liability is  a function  of  time,  and  it  is  not  clear  how  R(k)  can  be  related 
to  any  form  of  time-dependent  failure  rate  (e.g.  MTTF).  Thus  the  usefulness 
of  R(k)  is  questionable. 

Another  difficulty  is  that  the  model  is  not  easily  validated  from  the 
actual  data  from  which  its  results  are  obtained  or  from  future  operational  data. 
The  data  is  in  the  form  of  the  number  of  successes  and  the  number  of  tests  in 
each  stage.  The  model  produces  a reliability  level  per  stage.  It  is  not 
apparent  how  these  two  forms  can  be  compared  for  validation  purposes.  LaPadula 
demonstrates  how  R(k)  asymptotically  approaches  R(u),  but  since  R(k)  is  obtained 
from  R(u),  this  is  in  effect  comparing  the  model  with  itself  [61]. 


126 


In  most  software  packages  there  are  a number  of  failures  at  the  beginning 
of  testing.  If  R(u)  is  estimated  from  this  data  it  may  be  unreasonably  low. 

The  model  does  not  indicate  how  long  data  should  be  collected  to  lessen  the 
impact  of  these  early  failures  and  to  arrive  at  a "reasonable"  R(u).  Even  if 
such  a stopping  rule  existed,  least  squares  and  maximum  likelihood  techniques 
for  estimation  are  hardly  infallible,  and  little  confidence  could  be  placed  in 
R(u),  especially  if  the  software  is  a part  of  some  critical,  real  time  control 
system  (e.g.  air  traffic  control).  Associated  with  this  problem  is  the  fact 
that  as  k-x»  there  will  be  an  infinite  number  of  changes  to  the  software,  and 
it  may  have  changed  so  drastically  that  it  may  no  longer  be  the  same  software. 

Because  of  the  above  problems  we  do  not  feel  that  this  model  will  be 
useful  in  our  effort. 

1.7  APPLICATION  OF  PROMISING  MODELS  TO  DATA 

After  reviewing  the  assumptions  and  formulations  of  the  models  described 
in  Sections  1.5  and  1.6  we  wished  to  make  a more  extensive  comparison  of  the 
best  models  when  applied  to  a number  of  data  sets.  The  available  literature 
contains  or  describes  some  25  data  sets.  Most  of  these,  however,  are  not  com- 
pletely suitable.  Many  are  quite  small,  with  five  to  eight  data  points,  others 
are  only  presented  graphically,  and  not  tabulated.  After  a review  and  compar- 
ison of  the  available  data  sets  we  selected  four  for  further  study.  These 
were  Sukert's  data  set  one  [124],  Musa's  example  one  [83],  Baker's  report  of 
IBM  development  experience  to  Rome  Air  Development  Center  (RADC)  [8],  and  a 
set  of  data  gathered  from  Damman's  report  on  the  flight  test  of  a digital 
flight  control  system  [27]. 

Four  of  the  models  were  selected  as  being  "best".  They  were  the 
Jelinski-Moranda  exponential  model,  the  extended  Jelinski-Moranda  exponential 
model,  the  Geometric  de-eutrophication  model  and  the  Schneidewind  method  3. 


This  group  of  four  models  was  considered  broad  enough  to  meet  the  limitations 
of  most  data  sets.  Both  the  Jel inski-Moranda  model  and  the  Geometric  model 


require  data  collected  about  the  sequence  of  times  between  errors,  while  the 
extended  Jelinski-Moranda  and  Schneidewind  models  require  data  giving  the  number 
of  errors  in  some  uniform  time  period  (e.g.  errors  per  day). 

The  Jelinski-Moranda  and  extended  Jelinski-Moranda  are  typical  of  a 
family  of  exponential  models  which  also  includes  the  Shooman,  Miyamoto  and 
Musa  models.  They  are  distinguished  by  a hazard/correction  rate  propotional 
to  the  number  of  errors  remaining  in  the  program  and  the  assumption  of  a fixed 
initial  number  of  errors  in  the  program  with  no  new  errors  being  introduced 
during  debugging.  The  Geometric  model  and  Schneidewind ' s method  3 are  repre- 
sentative of  another  family  of  models  which  assume  a hazard/correction  rate 
proportional  to  the  rate  of  the  previous  interval  or  error.  This  allows  for 
new  errors  introduced  during  the  debugging  process,  so  long  as  the  number  of  new 
errors  created  is  proportional  to  the  number  of  old  errors  corrected.  Other 
models  in  this  family  are  Schneidewind ' s methods  1 and  2 and  the  Geometric- 
Poisson. 

In  the  absence  of  data  on  the  long  term  performance  of  specific  software, 
much  of  the  data  provided  by  a reliability  model  can  be  of  little  use.  In  this 
circumstance  such  information  as  total  initial  errors,  number  of  errors  remain- 
ing and  the  like  is  only  of  academic  interest.  After  all,  when  the  only  data 
available  is  time  between  errors,  all  that  can  be  validated  from  a model  is 
predicted  time  between  errors.  Additionally,  the  only  values  available  from 
all  the  models  are  either  predicted  and  estimated  past  time  between  errors,  or 
predicted  and  estimated  past  errors  per  period.  While  many  of  the  exponential 
family  models  provide  an  estimate  of  total  initial  errors  most  other  models 
do  not.  Several  papers  have  presented  results  using  cumulative  errors  detected. 
The  use  of  either  this  or  some  calculation  of  errors  remaining  introduces  an 
undesirable  smoothing  effect,  due  to  the  reduced  relative  error.  Specifically, 


I 


— 


the  Impact  of  this  smoothing  effect  can  be  seen  in  Figures  33  and  34.  Figure  33 
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SUKERT  S DATA  SET 


TIME  IN  WEEKS 


SUKERT'S  DATA  SET 


TIME  IN  WEEKS 


is  a plot  of  cumulative  errors  found  from  Sukert's  data  (described  below)  and 
the  Geometric-Poisson  model  fitted  to  the  last  18  weeks.  Figure  34  plots 
the  error  data  on  a week  by  week  basis.  The  use  of  errors  remaining  in  the 
evaluation  of  model  performance  introduces  another  problem  as  well.  This  is 
the  difficulty  of  estimating  or  determining  what  the  actual  initial  number  of 
errors  in  the  software  was.  Several  authors  have  assumed  that  all  errors  in 
the  software  under  study  had  been  detected  at  the  time  the  study  began.  For 
data  like  that  of  Figure  35  such  an  assumption  is  clearly  unwarranted.  It 
should  be  noted  that  at  least  one  author  has  even  declined  to  count  those 
errors  discovered  during  his  study. 

1.7.1  Sukert's  Data 

A.N.  Sukert  reports  in  [124]  and  summarizes  in  [123]  an  investigation 

of  several  software  reliability  models.  He  compared  the  performance  of  the 

extended  Jelinski-Moranda,  Schick-Wolvertort,  extended  Schick-Wolverton,  and 

Geometric-Poisson  models.  His  data  is  a sequence  of  2191  errors  grouped  by 

day  of  occurrence  over  a 165-day  period.  He  describes  the  data  as  being 

from  an  Air  Force  command  and  control  system, and  it  is  reproduced  in  Figure  35. 

Throughout  his  study  Sukert  evaluates  the  models  based  on  their  ability  to 

predict  the  number  of  remaining  errors.  As  described  above,  the  uncertainty 

in  determining  what  the  actual  number  of  remaining  errors  really  is,  together 

with  the  fact  that  models!  of  the  geometric  family  have  no  upper  bound  on  the 

i 

number  of  remaining  errors,  argues  against  that  approach  for  this  study. 

We  feel  that  a major  requirement  of  a study  of  this  type  is  the  use  of 
all  available  data,  exactly  as  given.  Many  of  the  papers  presenting  models 
use  published  data  sets,  but  trim  off  or  "bank"  errors  from  the  beginning  and 
end  of  testing  with  little  or  no  explanation  or  justification.  While  this  can 
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certainly  improve  a model's  performance,  it  should  require  very  good  justifi- 


cation. This  particular  data  set  shows  a strong  seven  day  cycle,  due  perhaps 
to  reduced  testing  on  weekends.  In  order  to  get  a more  consistent  time  scale 
with  respect  to  debugging  effort  and  error  exposure,  we  chose  to  group  the 
data  into  23  weekly  intervals.  This  data,  as  tabulated  in  Figure  36  and  plotted 
in  Figure  37,  was  analyzed  with  the  extended  Jelinski-Moranda  exponential  model 
and  Schneidewind ' s method  3. 

The  results  using  the  two  models  are  also  listed  in  Figure  36.  As  can 
be  seen,  these  two  models  give  very  similar  results,  differing  by  nine  errors 
per  week  at  most.  They  do  appear  to  give  equally  good  fit.  The  extended 
Jelinski-Moranda  is  less  smooth  due  to  its  use  of  actual  errors  in  its  hazard 
function.  Figure  38  plots  the  relationship  between  the  extended  Jelinski- 
Moranda  model  and  the  data.  Two  differences  between  the  models  did  become  ap- 
parent when  we  began  solving  them  for  smaller  subsets  of  the  data.  As  seen  in 
Figure  39,  the  extended  Jelinski-Moranda  model  is  somewhat  inconsistent.  Fairly 
small  changes  in  the  data  give  a significant  change  in  the  model's  shape  and 
prediction.  Also,  the  effect  of  the  assumption  of  a finite  number  of  errors  is 
quite  apparent  since  the  extended  Jelinski-Moranda  model  reports  no  errors  left 
five  weeks  before  the  end  of  testing. 

1.7.2  IBM  Data 

W.F.  Baker  of  IBM,  in  a report  to  Rome  Air  Development  Center,  describes 
and  analyzes  software  error  reports  from  a one-year  period  during  the  develop- 
ment of  a large  real-time  multi-processor  data  processing  system.  In  the  graph 
on  page  29  of  reference  [8]  is  a plot  of  errors  detected  per  month.  Below, 
Figure  40  gives  the  values  (approximate)  read  off  that  plot  and  also  lists  the 
results  of  applying  the  extended  Jelinski-Moranda  model  and  Schneidewind ' s 
method  3.  The  parameters  of  these  models  were: 
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SUKERT'S  DATA  SET 


FIG  31 


EXTENDED  J-M  APPLIED  TO  SUKERT'S  DATA  (WEEKLY) 
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WEEKS  OF  TESTING 
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Baker’s  Data  [8] 


Month 

Errors  Found 

Extended 

J-M 

Schneidewind 
(Method  3) 

March 

690 

884 

891 

April 

825 

791 

770 

May 

650 

681 

665 

June 

765 

594 

575 

July 

470 

491 

497 

August 

465 

428 

430 

September 

345 

366 

371 

October 

355 

319 

321 

November 

340 

272 

277 

December 

200 

226 

239 

January 

190 

199 

207 

February 

130 

174 
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Fig  40  F.xtended  J-M  and  Schneidewind  (Method  3) 


extended  Jelinski-Moranda  exponential  model: 

N = 6592.86 

0 = 0.13405213 

Schneidewind  method  3: 

a = 957.629 

6 = 0.1458545 

s = 4 

Again,  the  results  from  the  models  were  very  similar  where  Figure  42 
shows  the  extended  Jelinski-Moranda  model  with  the  original  data.  It  provides 
the  best  fit  when  the  models  performances  are  compared  to  the  actual  data 
using  a chi-square  test.  An  interesting  aspect  of  this  data  is  the  long-range 
prediction  both  models  give.  The  expected  number  of  errors  found  per  year  can 
be  tabulated  as  follows: 

Year 


1 (given  data) 

2 

3 

4 

5 


Errors  found  during  year 
5425 
964 
169 
29 
6 


Fig.  41  Long  Range  Predictions 

It  is  clear  that  a reduction  in  time  between  failures  from  12  days,  after  4 
years  of  debugging,  to  2 months,  at  the  cost  of  a fifth  year  of  testing  would 
be  hard  to  justify.  Figure  43  \ Lots  percent  deviation  between  the  model  pre- 
dictions and  the  actual  data  for  the  Schneidewind  method  3,  extended  Jelinski- 
Moranda  and,  additionally,  the  Geometric-Poisson.  This  reveals  that  the  actual 
performance  of  the  models  is  very  close,  at  least  for  this  data.  This  is  indeed 
true  in  many  cases.  The  Schneidewind  method  3 gives  a slightly  better  fit  than 
the  Geometric-Poisson  or  the  equivalent  Schneidewind  method  1.  The  two  model 
families  give  curves  that  are  very  similar.  The  extended  Jelinski-Moranda  model 
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gives  an  even  better  fit  than  the  Schneidewind  method  3,  but  has  two  disadvantages 
The  first  is  a sensitivity  to  erratic  data  as  seen  above.  The  second  disadvantage 
an  outgrowth  of  the  assumption  that  no  new  errors  are  introduced  during  debugging, 
is  best  described  as  optimism.  In  many  cases  the  extended  Jelinski-Moranda  gives 
a much  lower  prediction  of  future  bugs. 

1.7.3  Honeywell  Data 

Daraman  et.al.  in  reference  [27]  describe  the  flight  test  of  a multi-mode 
digital  flight  control  system.  Figure  44  summarizes  the  software  errors  en- 
countered during  the  flight  test,  together  with  cumulative  flying  time  and 
flying  time  between  errors.  Because  the  data  is  expressed  as  time  between 
errors  the  basic  Jelinski-Moranda  exponential  and  Geometric  models  were  applied. 
The  parameters  as  derived  by  maximum  likelihood  were: 

Jelinski-Moranda  exponential: 

0 = 0.06083073 

N = 4.37686774 

Geometric  model: 

K = 0.54044111 

D = 0.36422214 

The  results  of  the  two  models  are  plotted  in  Figure  45  with  the  actual 
data.  As  described  above,  the  inherent  "optimism"  of  the  exponential  family  is 
apparent.  In  fact  the  hazard  function  cannot  even  be  calculated  for  the  sixth 
error  (prediction).  The  term  (N-n)  is  negative,  giving  meaningless  results. 

1.7.4  Musa's  Data 

J.  D.  Musa,  in  reference  [83],  describes  a model  of  software  reliability 
(described  in  Section  1.6.9)  related  to  the  Jelinski-Moranda  exponential  model 


and  presents  a computer  program  implementing  his  model.  He  gives,  in  an  example 


Data  from  Honeywell  Flight  Test  Program  [27] 


Flight 

Number 

T 

i 

X. 

1 

Failure 

2 

4.0 

4.0 

Attitude  hold  step  at  change  in  algorithm 
at  45  - ratcheting  (page  39) 

3 

6.2 

2.2 

LFP  found  unacceptably  responsive  for 
large  inputs  (page  122) 

8 

16.3 

10.1 

LPA  software  would  not  engage  caused 
by  prior  change  (page  124) 

15 

28.9 

12.6 

CPU  1 power  supply  failed:  both  CPU's 
went  off  line  (page  98) 

43 

71.1 

42.2 

Deadspot  caused  by  underflow,  loss  of 
significance  in  lateral  trim  integrator 
(page  109) 

FLIGHT  HOURS  BETWEEN  ERRORS 


central  processor  execution  time  between  errors  for  38  errors  in  a software 
project  at  Bell  Laboratories,  This  data  is  plotted  together  with  the  Geometric 


model  in  Figure  46  and  with  the  Jel inski-Moranda  model  in  Figure  47.  The  model 
parameters  were: 

Jel insk 1-Moranda  model: 

0 = 0.0000653488 
N = 37.90032740 
Geometric  model: 

K = 0.88858641 
D = 0.01042702 

As  with  the  Honeywell  data,  the  Geometric  model  gives  a reasonable  fit.  It  is 
also  fairly  consistent  in  its  results  when  only  part  of  the  data  is  used.  The 
Jelinski-Moranda  model,  in  contrast,  runs  into  trouble  with  its  "no  new  errors" 
assumption.  As  can  be  seen  in  Figure  47,  it  shows  a sudden  sharp  rise  in  the 
time  between  errors,  and  indicates  by  the  N value  that  the  IS1"*1  error  was  the 
last  one  in  the  software.  This  does  not  seem  to  fit  the  data,  and  in  our  exper- 


ience with  this  model  it  has  always  been  more  optimistic  than  the  Geometric. 

The  larger  the  data  set,  the  more  likely  it  is  to  claim  a sudden  improvement  and 


suggest  that  debugging  is  complete.  This  is  illustrated  by  the  table  in  Figure 


t 


I 


48  giving  mflodel  solutions  using  various  portions  of  this  data  set.  At  this  time, 
the  Geometric  model  appears  to  be  slightly  better  than  the  Jelinski-Moranda  model 
for  data  in  terms  of  time  between  errors. 

1.8  A METHOD  OF  INCREASING  SURITY  IN  MODEL  APPLICATION 

In  Section  1.2  we  saw  the  undesirability  of  total  program  testing  (i.e. 
checking  all  possible  paths  through  a program),  and  the  prohibitive  complica- 
tions involved  in  formal,  inductive  proofs  of  correctness  for  large  programs. 

The  seeding  and  tagging  techniques,  discussed  in  Section  1.3,  though  they  show 
some  promise,  are  not  at  an  advanced  enough  stage  of  development  to  be  used  in 
our  analysis. 
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Fig.  48  J-M  and  Geometric  Models  Applied  to  Musa's  Data 


This  led  to  our  accepting  a stochastic  process  for  software  error  occurrence 
and  to  the  conclusion  that  certain  analytical  models  (vis-a-vis,  structural 
and  empirical  models  in  Sections  1.4  and  1.5  respectively)  can  best  represent 
the  software  reliability.  Debugging,  the  iterative  process  of  exercising  the 
software  until  an  error  is  discovered  and  removed,  is  the  most  widely  used  method 
of  increasing  software  reliability.  Debugging  continues  until  the  occurrence  of 
errors  becomes  so  infrequent  that  the  software  is  considered  "acceptable".  The 
analytical  models  presented  earlier  attempt  to  measure  and  predict  the  reliabil- 
ity. However,  what  would  be  desirable  would  be  a method  which  determines  (1) 
how  much  confidence  can  be  placed  in  the  model  results  and  (2)  when  the  model 
results  are  accurate  enough  for  the  debugging  process  to  be  stopped.  Such  a 
method  would  indeed  be  useful  in  the  study  of  software  reliability. 

A very  recent  development  which  addresses  these  very  topics  was  presented 
by  Forman  and  Singpurwalla  [37],  They  point  out  that  although  the  method  of 
maximum  likelihood  gives  better  results  than  either  the  least  squares  or  the 
Bayesian  methods  in  estimating  model  parameters,  there  can  be  difficulties  with 
an  indiscriminant  application  of  it.  They  also  present  an  empirical  stopping 
rule  for  debugging  the  software. 

The  Jelinski-Moranda  model  has  the  hazard  function  0 [ N— ( i— 1 ) ] where  N 
is  the  total  number  of  initial  errors  and  0 is  the  constant  of  proportionality. 
The  data  required  for  this  model  is  the  sequence  of  times  between  errors  (i.e. 

1 1 , t ^ , t .t^).  ^ ant*  ^ are  mo^e^  parameters  and  the  best  method 

of  estimation  is  maximum  likelihood.  However,  if  n (the  total  number  of  errors 


discovered  to  date)  is  very  much  less  than  N,  then  it  is  possible  for  the  ratio 

n 

of  the  weighted  sum  of  time  between  errors  ( £ ( i — 1 ) t.)  to  the  total  debugging 


time  ( £ t.)  to  be  very  small.  This  in  turn  can  cause  unstable  and  misleading 

A A 

values  of  N and  0,  the  maximum  likelihood  estimators  of  N and  0 respectively. 


To  guard  against  such  a situation  Forman  and  Singpurwalla  suggest  com- 
paring the  graphs  of  the  following  two  functions: 
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(1)  R(N),  the  relative  likelihood  function  of  N 


(2)  R , (N) , the  normal  relative  likelihood  function  for  N. 
normal 

Large-sample  theory  ensures  that  N is  approximately  normally  distributed 
with  mean  N.  R(N)  measures  the  relative  plausibility  of  various  values  of 

A 

N and  its  most  plausible  value  N.  R . (N)  represents  the  distribution  of 

the  relative  likelihood  under  the  large-sample  condition.  A comparison  of 

R(N)  and  R , (N)  will  give  some  indication  of  how  accurate  an  estimate  of 

normal  ° 

N is.  It  will  also  be  possible  to  determine  confidence  limits  for  N and  0. 

In  light  of  the  above  results,  the  authors  present  the  following  stopping 
rule  for  the  debugging  of  software: 

(1)  Compute  N,  the  maximum  likelihood  estimator  of  N. 

A A 

(2)  If  N = n (or  very  close  to  it),  proceed  to  Step  3;  if  N>>  n, 

observe  another  failure,  record  the  time  (t  , for  that  failure 

n+i 

to  occur,  and  go  to  Step  1. 

(3)  Compute  R(N)  and  compare  it  with  R If  a plot  of  the 

two  functions  shows  a large  disparity,  the  observed  estimator  is 
misleading.  Observe  another  failure  time  (tn+j)  and  go  to  Step  1. 

If  a plot  of  the  two  functions  is  in  good  agreement,  stop  testing 
and  accept  the  software. 

The  findings  of  Forman  and  Singpurwalla  are  quite  pertinent  to  any  study 
of  analytical  software  reliability  models  which  utilize  maximum  likelihood 
techniques  for  parameter  estimation.  They  clearly  show  that  care  must  be 
taken  in  the  parameter  estimation, and  that  great  confidence  cannot  be  given 
to  the  estimators  unless  certain  conditions  are  first  met.  Their  ideas  are  a 
definite  contribution  to  the  development  of  a means  of  assuring  greater 
reliance  on  the  software  reliability  models. 

The  stopping  rule  for  debugging  seems  to  have  promise  in  aiding  software 
developers  in  their  decision  of  "how  good"  their  software  is  at  any  given  stage 
of  testing.  However,  we  feel  that  before  it  can  be  made  extensively  applicable, 
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it  will  be  necessary  to  quantify  the  relationship  between  R(N)  and  R , (N)  . 

no  rma 1 

Presently,  it  is  a subjective  decision  as  to  how  "good"  the  agreement  between 
the  two  curves  is.  The 'comparison  of  the  two  curves  is  critical  in  the  stopping 
rule,  and  what  may  appear  good  to  one  person  may  not  be  so  for  someone  else. 

Forman  and  Singpurwalla  did  apply  their  technique  to  the  F-11D  data  pro- 
gram given  by  Wagoner  in  [134]  and  reproduced  as  Figure  6 (Section  1.6.2)  in 
this  report.  If  testing  were  terminated  after  the  first  day  (1/12),  Figure  49 
would  be  derived  with  N = 9 and  n = 8.  As  can  be  seen,  a large  disparity 
between  R , (N)  and  R(N)  is  quite  evident.  Hence,  this  implies  that  debug- 
ging  should  be  continued.  Finally,  if  the  procedure  is  applied  after  the  last 
interval  (1/31),  we  find  that  N = 107  and  n = 107.  Figure  50  does  show  a very 
good  agreement  between  the  two  derived  curves. 

Of  course,  the  stopping  rule  is  designed  for  a test  situation,  but  very 
often  we  are  presented  with  a situation  in  which  we  are  simply  handed  a set  of 
test  data  after  the  testing  has  ended,  with  no  chance  to  test  further.  If  upon 
applying  one  iteration  of  the  stopping  rule  we  find  that  the  parameters  estima- 
tors are  inadequate  and  that  further  testing  is  needed,  we  are  stuck.  We  do 
in  fact  know  that  such  model  results  are  poor,  but  we  can  do  nothing  to  improve 
them  in  these  cases. 

One  additional  point  deserves  mention.  The  most  common  form  in  which 
software  error  data  is  collected  is  errors  per  interval  rather  than  time  be- 
tween errors.  In  fact  it  is  usually  quite  difficult  to  obtain  data  in  the  time 
between  errors  format.  Consequently,  the  stopping  rule  would  have  to  be  ad- 
justed to  handle  the  more  common  situation  of  errors  per  interval. 

The  state-of-the-art  in  software  reliability  modeling  is  not  yet  at  a 
stage  where  extreme  reliance  can  be  placed  on  the  results.  Yet  Forman  and 
Singpurwalla  do  remove  much  of  the  uncertainty  that  once  existed. 
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SECTION  11 


SOFTWARE  ERROR  DATA 

2.1  INTRODUCTION 

in  order  to  construct  and  apply  any  of  the  models  that  have  been  pre- 
sented, it  is  imperative  that  a "clean"  and  meaningful  data  set  be  available. 

To  date,  the  availability  of  usable  software  error  data  has  been  a major 
problem  which  has  plagued  many  of  the  software  model  developers.  Jelinski 
and  Moranda  describe  this  problem: 

"It  was  surprising  to  find  that  there  was  no  such  thing  as  software 
failure  data  collection  for  analysis'  sake.  The  obvious  place  to  look  was 
inside  the  company  where  many  software  systems  are  developed;  strangely 
enough,  nobody  is  interested  in  software  failures,  though  everyone  in  con- 
cerned with  software  reliability.  It  may  be  a paradox,  but  most  'bugs'  are 
fixed  without  failures  being  documented,  and  yet  perfect  software  is  required." 

When  Jelinski  and  Moranda  did  find  data,  they  found  that  it  was  not 
available  in  a usable  form,  being  either  classified,  summarized,  or  vague. 
Compounding  this  problem  is  the  reluctance  of  manufacturers  to  release  data, 
reflecting  pethaps  their  unwillingness  to  admit  that  they,  as  do  all  software 
designers,  make  mistakes. 

In  the  forthcoming  discussion,  these  problems,  as  well  as  other  data 
acquisition  problems,  will  be  analyzed. 

2.2  ERROR  DATA:  NECESSITY  AND  PROBLEMS 

Initially,  it  should  be  established  that  acquiring  quality  software 
error  data  in  sufficient  quantities  has  been  a formidable  task.  Not  only  will 
such  data  be  useful  in  modeling  studies,  but  it  may  also  be  beneficial  in  plan- 
ning future  software  projects.  For  example,  error  data  might  be  utilized  in 
predicting  the  amount  of  testing  Lime  required,  assessing  the  costs  associated 
with  testing,  and  estimating  types  of  errors  by  category.  However,  it  appears 
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as  though  the  availability  of  well  documented  error  data  would  be  most  ad- 
vantageous in  evaluating  the  viability  of  the  numerous  software  reliability 
models  proposed  to  date.  Without  such  data,  accurate  appraisals  of  many  of 
these  models  cannot  be  made,  and  subjective  judgment  is  forced  to  play  a major 
role.  Jelinski  and  Moranda  [56]  assess  the  overall  problem  and  comment  "the 
importance  of  software  reliability  in  our  strategic  defense  systems  and  in 
our  space  systems  cannot  be  overemphasized".  Finally,  they  hypothesize  that 
given  a good  set  of  data,  it  would  be  possible  to  develop  a fairly  accurate 
model,  which  would  include  the  computation  of  confidence  limits  and  the  cap- 
ability of  predicting  future  software  failures. 

Hence,  the  availability  of  error  data  would  not  only  serve  as  a basis 
for  evaluating  the  models  proposed  to  date,  but  it  would  also  support  the 
development  of  newer  models.  Therefore,  it  must  be  reiterated  that  the  need 
for  well  documented  software  error  data  cannot  be  emphasized  strongly  enough. 
In  fact,  the  Rome  Air  Development  Center  (RADC)  is  attempting  to  build  up  a 
software  data  repository,  which  hopefully  will  be  of  considerable  use  in 
software  reliability  studies.  In  early  1976,  RADC  awarded  contracts  to  var- 
ious organizations,  including  IBM/Federal  Systems  Division,  Boeing  Aerospace 
Company,  Raytheon  Company,  and  The  Charles  Stark  Draper  Laboratory,  Inc.  The 
primary  purpose  of  the  aforementioned  contracts  was  to  acquire  software  error 
data  from  large  software  projects.  This  research  is  mandatory  if  reliable, 
maintainable,  and  quality  software  is  to  be  produced  in  both  the  industrial 
and  military  environments.  Baker  [8],  Fries  [40],  Willman  et.  al.  [137],  and 
Rye  et.  al.  [105]  discuss  their  respective  approaches,  problems,  and  conclu- 
sions regarding  the  acquisition  of  adequate  software  error  data. 

Numerous  problems,  in  both  these  recent  efforts  and  in  prior  studies,  are 
existent  and  require  further  investigation.  Often,  during  the  data  collection 
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process  of  recent  software  projects,  the  following  comments  were  commonplace. 
Thayer  et.al.  [128]  believe  "presently  data  collection  schemes  don't  provide 
enough  of  the  right  kind  of  data,  nor  are  the  data  collected  at  the  right 
time".  Baker  [ 8 ] states  "some  of  the  data  collected  during  the  SDA  study 
was  not  as  valid  as  had  been  hoped.  The  major  portion  was  collected  after 
the  fact,  and  much  of  the  source  information  needed  had  already  been  disposed 
of  or  had  never  been  present.  Other  factors  that  would  have  contributed  to 
the  study  became  apparent  during  the  course  of  the  contract".  Hence,  it 
becomes  readily  apparent  that  when  error  data  has  been  collected  it  often- 
times was  in  too  general  a form  or  not  the  right  type  of  data.  Improper 
planning  is  directly  responsible  for  these  and  many  other  problems  encountered 
in  data  acquisition  studies.  Therefore,  in  the  future,  poor  planning  must  be 
avoided  at  all  costs  since  the  majority  of  the  time  data  cannot  be  reconstructed 
later  on  if  it  was  not  collected  initially. 

In  the  literature,  many  authors  have  strongly  agreed  that  software 
projects  do  have  the  potential  to  produce  large  amounts  of  quality  software 
error  data.  However,  as  was  stated,  some  of  the  studies  conducted  in  the  past, 
in  which  a great  deal  of  effort  was  expended,  produced  results  which  were  not 
as  useful  as  had  been  anticipated. 

On  the  other  hand,  some  investigations  have  realized  moderate  success. 
Shooman  and  Bolsky  [114]  mention  "it  was  shown  that  such  a study  can  be  done, 
that  meaningful  data  can  be  obtained  and  analyzed".  Fries  [40]  is  of  the 
opinion  that  not  only  can  useful  data  be  collected  but  it  must  be  collected. 

Furthermore,  complicating  the  modeling  and  data  collection  process,  is 
the  problem  of  selecting  an  appropriate  time  base.  That  is,  data  may  be 
available  only  with  respect  to  calendar  time,  which  is  not  at  all  likely  to 
be  an  appropriate  time  frame  for  software.  In  fact,  later  in  this  section, 
it  will  be  seen  that  many  of  the  data  sets  available  in  the  literature  are 
presented  in  terms  of  calendar  time.  In  many  of  these  cases,  model  applications 




to  such  data  produced  poor  results  (predictions),  and,  in  our  opinion,  the 
time  scale  was  one  of  the  contributing  factors. 

Hence,  a time  base  which  reflects  either  the  "age"  of  the  software  in 
terms  of  execution  time,  or  the  amount  of  debugging  effort  (manhours)  that 
have  been  applied  to  it  is  clearly  more  appropriate.  Furthermore,  there  is 
no  reason  to  expect  that  calendar  time  would  have  any  clear  relationship  to 
these  alternate  measures.  As  a result,  a careful  analysis  of  the  data  would 
be  required  to  either  scale  or  weight  the  data  to  an  appropriate  scale  as 
preparation  for  applying  a particular  software  model  We  do  not  mean  to 
suggest  that  there  is  an  easy  solution  to  this  problem;  on  the  contrary, 
extreme  caution  should  be  exercised  and  thorough  consideration  should  be 
given  to  the  available  data.  The  success  or  failure  of  a model  may  depend 
on  the  selection  of  an  appropriate  time  reference.  It  is  our  opinion  that 
CPU  or  operational  time  is  the  most  appropriate  time  base  to  employ,  especially 
when  evaluating  the  reliability  of  an  operational  piece  of  software. 

How  can  these  aforementioned  problems  be  alleviated?  One  initial 
question  to  be  considered  is:  How  should  the  data  be  collected  and  at 
what  level  of  detail?  The  answer  to  this  question  has  not  been  clearly  estab- 
lished at  this  point  in  time  to  satisfy  all  researchers.  Thayer  et.al.[128] 
propose  the  following  approach. 

(1)  Determine  what  data  is  available. 

(2)  Consider  what  methods  can  be  utilized  in  collecting  and  storing 
these  data. 

(3)  Establish  criteria  for  documenting  software  errors. 

(4)  Consider  approaches  for  characterizing  the  software,  and  the 
development  and  test  processes  in  quantitative  terms„ 

(5)  Specify  methods  of  analysis. 
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Fries  [40]  suggests  the  following,  which  is  more  specific  than  that 


presented  above: 

(1)  Define  categories  for  various  types  of  errors. 

(2)  Identify  the  source  of  the  error. 

(3)  Specify  the  type  of  correction  made. 

(4)  Record  the  time  to  both  find  and  fix  the  error. 

Although  both  approaches  are  appealing  in  some  respects,  a well 

defined  and  universally  accepted  procedure  would  be  desirable.  However, 
it  would  be  highly  probable  that  problems  would  continue  to  arise.  As  pre- 
viously discussed,  software  error  data  may  be  used  by  different  groups,  each 
of  which  may  have  a unique  goal  or  need  for  such  data.  Thus,  while  a method 
of  acquisition  derived  for  modeling  analysis  may  be  attractive  to  researchers 
in  that  area,  at  the  same  time  it  may  not  satisfy  certain  requirements  for 
groups  whose  interests  lie  in  areas  other  than  modeling. 

Erdres  [34]  believes  "it  may  be  beneficial  to  design  a set  of  questions 
pertaining  to  the  errors  so  that  you  may  identify  what  you  need  to  collect". 

It  is  our  belief  that  exactly  what  needs  to  be  collected  must  be  clearly 
specified  prior  to  undertaking  an  intensive  data  collection  program.  First, 
a particular  procedure  or  format  for  data  collection  should  be  established. 
Then,  comparing  the  proposed  approach  with  the  requirements  of  various  software 
models  should  indicate  whether  everything  necessary  will,  in  fact,  be  collected 
so  that  meaningful  model  applications  are  possible. 

Specifically,  there  are  a multitude  of  factors  to  seriously  consider 
when  Identifying  requirements  for  such  studies.  Perhaps  the  aspect  which  im- 
pacts the  data  collection  process  the  most  deals  with  programming  personnel. 

The  amount  of  time  and  effort  expended  to  accurately  fill  out  error  reports  or 
other  related  forms  must  be  minimized.  Their  time  is,  and  must  be,  primarily 
applied  toward  software  production,  testing,  and  verification.  Hence,  due  to 
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such  demanding  schedules  and  the  costs  involved,  it  can  be  extremely  difficult 
to  accurately  collect  data  in  the  development  process.  Therefore,  the  acquisi- 
tion of  software  error  data  must  be  a process  that  takes  little  time  yet  is  also 
effective  in  achieving  requirements.  It  appears  as  though  a checklist  type 
form  is  more  acceptable  than  a descriptive  writeup  of  errors.  One  immediate 
advantage  is  that  it  would  be  less  time  consuming  if  it  were  well  designed. 

Among  others,  Shooman  and  Bolsky  [114]  and  Wagoner  [134]  advocate  the  use  of 
the  aforementioned  type  format.  Wagoner  presents  what  is  called  a Job  Analysis 
Sheet  which  was  designed  to  be  one  page  in  length  so  as  to  minimize  the  incon- 
venience to  the  programmer  reporting  the  error.  This  point  is  supported  by  the 
Raytheon  study  where  a common  complaint  among  programmers  was  that  the  forms 
being  used  were  too  detailed.  It  should  be  pointed  out  however,  that  in  Shooman 
and  Bolskv's  study,  the  programmers  believed  that  the  data  collected  was  both 
valid  and  useful,  and  that  the  amount  of  time  to  fill  out  the  error  reports 
was  not  excessive.  Thus,  the  importance  of  the  design  of  the  error  reporting 
form  to  be  completed  is  very  evident  as  the  comments  from  the  Raytheon  and 
Shooman-Bolsky  studies  indicate.  A strong  attempt  must  also  be  made  to  record 
and  document  the  errors  as  completely  as  possible.  Two  points  to  be  considered 
here  are  (1)  prior  to  commencement  of  the  project,  check  for  proper  understanding 
of  the  error  report  forms,  and  (2)  explain  to  the  involved  personnel  why  the  data 
is  being  collected.  Good  documentation  is  necessary,  as  in  one  particular  study 
many  errors  were  recorded  so  generally  that  they  had  to  be  eliminated  from  the 
study,  thus  distorting  the  "true"  error  process. 

This  leads  to  the  problem  of  how  should  error  reporting  forms  be  designed 
and  what  should  be  included.  One  approach  presented  [128]  was: 

(1)  When  was  the  error  introduced? 

(2)  How  critical  is  the  error? 

(3)  How  and  when  was  the  error  found? 
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(4)  How  much  testing  was  done? 


1 


1(5)  Was  the  error  independent  of  other  errors  or  the  result  of 

a previous  fix? 

In  our  experience  with  a particular  manufacturer,  the  following  format 
| for  collecting  error  data  was  deemed  mutually  acceptable: 

(1)  Identify  the  program  function  where  the  error  was  discovered. 

(2)  Classify  the  error  by  type  (coding,  specification,  etc.). 

(3)  Determine  and  categorize  the  criticality  of  the  error. 

(4)  Estimate  the  operating  time  since  the  last  error  in  that 
segment/function  was  discovered. 

I I 

The  fourth  point  listed  above  brings  up  an  interesting  question.  That  is, 
how  useful  are  these  estimates  going  to  be  in  a modeling  study?  It  may  be 
recalled  that  in  the  discussion  of  the  individual  models,  some  of  them  required 
time-between-failures  as  an  input  (Geometric,  Jelinski-Moranda,  etc.).  Then, 
in  order  to  be  able  to  apply  this  type  of  model,  data  of  this  type  needs  to 
be  obtained.  To  date,  very  little  data  available  in  the  literature  was  re- 
corded in  this  manner.  However,  this  would  be  the  ideal  way  to  record  errors 
in  our  opinion.  Below  are  our  reasons  for  believing  this: 

(1)  Data  in  the  form  of  time-between-failures  would  be  applicable 
to  the  Geometric,  Schick-Wolverton,  etc.  models. 

(2)  This  data  could  be  converted  to  errors  per  interval  to  input  the 
extended  Jelinski-Moranda,  Geometric-Poisson,  etc.  models. 

(3)  Model  evaluations  could  be  made  on  the  basis  of  application  to  a 
single  data  set. 

Data  recorded  in  the  form  of  errors  per  interval  is  useful  only  to  the 
models  mentioned  in  point  two  above.  That  is,  such  data  cannot  be  converted 
back  to  time-between-errors . If  one  wished  to  determine  which  models  were 
the  best  predictors,  assessment  of  their  capability  could  most  easily  be 
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evaluated  by  application  to  a single  set  of  error  data.  Thus,  although  time- 
between-errors  appears  to  be  the  better  measure,  most  recorded  error  data  sets 
are  not  of  this  type. 

It  has  also  been  observed  that  most  organizations  participating  in  software 
error  collection  studies  firmly  believe  that  error  categorization  is  a necessary 
requirement.  However,  extreme  caution  must  be  exercised  so  that  the  categories 
are  defined  such  that  no  discrepancies  occur  as  to  what  category  an  error 
should  be  assigned  to.  In  other  words,  if  too  few  categories  are  defined,  errors 
may  occur  which  do  not  correspond  to  an  appropriate  category.  On  the  other  hand, 
defining  an  excessive  number  of  categories  may  present  problems  in  that  an  error 
might  "seem"  to  fall  into  more  than  one  category.  A TRW  study  showed  that  re- 
ducing the  number  of  categories  originally  specified  proved  more  usable  in  the 
data  collection  project  and  also  was  much  more  convenient  to  the  programmers 
collecting  the  data.  Thus,  it  would  seem  as  if  some  basic  categories  could  be 
established  for  all  software  projects  with  additional  categories  set  up  for 
individual  projects  if  necessary.  Now,  some  of  the  benefits  of  the  categoriza- 
tion process  may  be  realized.  First,  it  may  be  possible  to  identify  those 
areas  which  cause  the  most  problems  in  terms  of  costs  and  effort  expended. 

Second,  the  percentage  of  errors  in  various  categories  (e.g.  logic  - 15%,  I/O  - 
2%,  etc.)  may  be  helpful  in  estimating  the  number  of  errors  in  these  respective 
categories  in  future  projects.  One  final  suggestion  concerning  error  categor- 
ization is  that  since  some  judgment  may  be  involved,  it  might  be  beneficial  to 
obtain  concurrence  with  a supervisor  or  other  personnel  regarding  the  correct 
classification  of  certain  errors. 

Finally,  it  is  mandatory  that  only  software  errors  be  included  in  the 
collection  and  categorization  process.  Records  on  hardware  errors,  user  re- 
quested changes,  and  other  "non-software"  errors  may  be  kept  if  desirable. 


Baker  [8]  sums  up  this  particular  area  very  well  by  saying  "the  proper 
assignment  of  error  categories  is  a key  to  the  study  of  error  reliability 
modelling" . 

Another  problem  to  be  aware  of  is  the  recording  of  duplicate  error 
reports.  It  is  our  belief  that  each  error  should  only  be  counted  once,  and 
this  is  in  agreement  with  the  assumption  made  by  many  of  the  software  reliability 
models.  A study  conducted  by  one  organization  indicated  that  duplicate  problem 
reports  occurred  frequently  in  a few  select  categories.  In  such  instances,  it 
was  necessary  to  insure  that  duplicate  reports  were  ignored  so  the  error  was 
counted  only  once.  The  problem  of  error  generation  during  the  debugging  process 
has  been  and  continues  to  be  a problem  also.  While  many  of  the  models  make  the 
assumption  that  no  new  errors  are  added  during  debugging,  in  reality,  there  is 
a distinct  probability  that  fixing  other  errors  may  introduce  additional  software 
failures. 

The  Raytheon  data  acquisition  study  [128]  identifies  several  other 
characteristics  of  a software  system  which  may  cause  difficulties  with  the 
interpretation  of  error  data  from  that  system.  Our  experience  reinforces  the 
fact  that  careful  consideration  must  be  given  to  these  items  in  the  analysis 
of  software  data. 

(1)  Evolutionary  development  of  software  requirements  and  of  the 

system  itself  — Large  software  systems  do  not  just  hatch  overnight, 
but  evolve  in  a step  by  step  fashion  over  a long  period  of  time. 

During  planning,  the  requirements  may  change  often,  and  during 
development,  the  system  will  change  as  well.  Thus,  it  may  be 
difficult  to  determine  the  moment  when  the  developing  system 
actually  "becomes"  the  system  and  when  the  error  history  becomes 
applicable. 
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(2)  Parallel  hardware  development  — Oftentimes  the  hardware  under- 


goes changes  during  software  development  which  may  cause  a soft- 
ware error  that  would  not  otherwise  be  experienced. 

(3)  Multiple  system  configurations  — The  software,  or  portions  of 
the  software,  may  be  tested  and  verified  at  different  facilities 
which  have  slightly  different  functions  (e.g.  I/O).  Thus,  an 
error  experienced  at  one  facility  may  be  unique  to  that  facility 
and  not  to  the  software  itself. 

(4)  Build  process  — The  reliability  of  a piece  of  software  will  often 
fluctuate  with  successive  builds  or  integrations  to  the  system. 

(5)  Uneven  application  of  resources  — The  number  of  errors  discovered 
will  be  dependent  on  the  number  of  debuggers  working  or  on  "how 
hard"  they  are  working. 

(6)  Previously  existing  software  — In  some  software  systems,  certain 
portions  may  not  be  entirely  new  but  merely  modified  from  pre- 
viously written  packages  which  may  already  be  debugged. 

Endres  [34]  summarizes  the  problem  of  software  error  data  acquisition 
well.  He  states  "analysis  of  errors  is  a difficult  task,  yet  it  is  a necessary 
and  useful  activity".  This  does  seem  to  be  very  true,  although  it  is  hoped 
that  future  data  collection  studies  will  be  well  designed  and  carefully 
planned  so  that  meaningiul  data  will  be  available  for  model  evaluation  and 
future  model  development. 

2.3  DATA  SETS  EXTRACTED  FROM  THE  LITERATURE 

Below  is  an  individual  listing  of  the  software  error  data  sets  found  in 
various  papers  and  reports  during  the  course  of  our  study.  The  author(s), 
reference  number,  and  type  of  data  are  given,  as  well  as  some  brief  comments 
on  the  utility  of  the  respective  data  sets.  Our  comments  are  not  intended  to 
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be  critical  or  reflect  negatively  on  these  data  collection  studies,  but  are 

merely  our  interpretation  and  evaluation  of  the  usefulness  of  each  data  set 

as  related  to  our  objectives. 

(1)  W.  T.  Baker  [8];  5446  errors  are  distributed  among  12  monthly  intervals. 

A graph  of  errors  per  month  is  given  although  no  numerical  tables  are 
presented.  Because  the  graph  was  large,  it  was  possible  to  read  off  it 
to  obtain  fairly  accurate  estimates  of  the  number  of  errors  per  month, 
and  the  more  promising  models  were  then  applied  to  this  data. 

(2)  J.  de  S.  Coutinho  [22];  errors  are  listed  in  36  weekly  intervals. 

Erratic  behavior  is  quite  evident  and  may  in  part  be  due  to  the  fact 
that  calendar  time  was  used  as  the  time  scale.  Thus,  there  is  no  way 
to  determine  how  much  time  was  actually  expended  per  week. 

(3)  L.  Damman  et.al.  [27];  five  errors  were  extracted  from  this  report  in 
the  form  of  t ime-between-failures . Although  the  data  set  was  extremely 
small,  it  was  one  of  the  few  sets  of  data  where  time-between-failures 
was  available.  Hence,  this  data  set  was  used  with  those  models  requiring 
such  data. 

(4)  M.  J.  Fries  [40];  2036  errors  are  listed  by  category.  The  data  appears 

to  be  in  a summarized  form  although  it  is  felt  that  if  the  raw  data  were 
available  model  application  would  be  possible. 

(5)  M.  Lipow  and  T.  A.  Thayer  [65];  25  groups  of  software  problems  are 
presented  in  this  paper.  It  was  felt  that  this  data  set  was  in  a form 
not  suitable  for  application  and  was  not  pursued  since  "better"  data 
sets  were  available. 

(6)  B.  Littlewood  and  J.  L.  Verrall  [67];  80  times-between-failures  are 

printed  out  in  a table.  However,  this  data  was  generated  through  simulation 
and  thus  could  not  be  classified  as  "real"  error  data. 
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(7)  I.  Miyamoto  [77];  graphs  are  presented  in  this  paper,  however  no 
numerical  data  is  given  to  supplement  these  graphs.  Due  to  this 
limitation  it  was  felt  that  an  evaluation  of  the  models  was  not  possible 
with  this  particular  data  set. 

(8)  P.  B.  Moranda  [79];  two  sets  of  data  are  available.  The  first  set 
contains  3272  errors  broken  down  in  six  monthly  intervals.  Set  number 
two  recorded  1905  errors  distributed  among  five  monthly  intervals.  These 
sets  were  somewhat  restricted  since  the  few  monthly  intervals  given  could 
not  be  further  broken  down  and  also  since  calendar  time  was  employed  as 
the  time  scale. 

(9)  J.  D.  Musa  and  P.  A.  Hamilton  [83];  two  sets  of  data  were  available  in 

this  report  also.  Both  sets  were  recorded  in  terms  of  time-between-failures . 
The  first  set  contained  38  points,  was  useful  in  our  model  application,  and 
reasonably  good  results  were  derived.  Set  number  two,  comprised  of  53 
data  points,  was  much  more  erratic  and  poorer  results  were  obtained  when 
the  models  were  applied. 

(10)  P.  Rye  et.al.  [105];  11,729  modifications  are  given  and  were  obtained 

from  the  Apollo  program.  It  was  our  interpretation  that  these  modifications 
were  not  all  software  errors  but  included  enhancements  or  modifications  to 
the  program,  and  it  was  not  possible  to  separate  out  the  software  errors 
from  the  enhancements. 

(11)  N.  F.  Schneidewind  [108];  this  paper  presents  graphical  data  but  no 
numerical  results  are  included.  It  was  also  difficult  to  read  accurately 
off  the  graph  due  to  the  scale  of  each  axis. 

(12)  N.  F.  Schneidewind  [109];  only  five  data  points  are  presented, and  it 

was  our  opinion  that  this  data  set  was  too  small  for  a meaningful  applica- 
tion to  be  made. 
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(13)  M.  L.  Shooman  [114];  seven  sets  of  data  are  presented,  and  although 
an  adequate  number  of  errors  are  available  with  each  data  set,  the 
intervals  are  in  months  and  cannot  be  broken  down  further.  Further, 
calendar  time  was  used  as  the  time  scale,  which  somewhat  detracted  from 
the  attractiveness  of  the  data. 

(14)  A.  N.  Sukert  [124];  2191  errors  were  grouped  by  days.  To  reduce  the 

number  of  intervals,  weekly  intervals  were  established  for  model  appli- 
cation. Calendar  time  was  employed  as  the  time  frame  and  erratic  behavior 
was  very  noticeable. 

(15)  W.  L.  Wagoner  [134];  107  errors  were  observed  and  could  be  grouped  into 

errors  per  interval.  This  was  one  of  the  better  data  sets  available  since 
CPU  time  was  used.  Thus,  some  of  the  software  models  were  applied  to  this 
data. 

(16)  H.  E.  Willman  et.al.  [137];  2165  Software  Problem  Reports  (SPR's)  were 

listed  by  category.  This  report  did  give  a graph  of  SPR’s  per  month  but 
numerical  data  was  not  included  in  the  summarized  listing. 


SECTION  TT1 


CONCLUSIONS  AND  RECOMMENDATIONS 

3.1  MODEL  CONCLUSIONS 

The  software  error  data  sets  published  to  dat.  follow  no  single  format. 

Many  software  projects  gather  data  to  fit  the  input  requirements  of  some  ex- 
isting error  reporting  system.  The  resulting  variety  in  data  organization  pre- 
vents use  of  any  single  model  in  all  eases.  If  data  is  organized  and  reported 
as  time  between  errors  (on  some  time  scale  consistent  with  debugging  effort), 
then  the  Geometric  model  appears  to  be  the  best  choice.  If  data  is  collected 
or  can  be  grouped  as  errors  experienced  during  a series  of  uniform  intervals 
(uniform  with  respect  to  an  error's  chance  of  being  detected)  then  Schneidewind' s 
method  3 using  the  weighted  least  squares  criterion  should  be  best.  These  con- 
clusions reflect  our  experience  from  applying  various  models  to  m.  nv  of  the  pub- 
lished data  sets  listed  earlier.  In  those  trials  these  two  models  consistently 
gave  better  and  more  useful  results.  At  this  point  we  can  say  little  to  compare 
these  two  models  or  the  respective  methods  of  organizing  error  data. 

While  we  consider  the  Schneidewind  method  3 and  the  Geometric  models  to 
be  the  most  accurate  and  useful  of  the  models  we  have  looked  at,  it  is  also  our 
contention  that  these  two  models  need  still  further  validation  on  an  actual 
system  before  we  can  place  great  faith  and  confidence  in  their  results.  In 
addition,  an  estimate  of  the  precision  or  accuracy  of  each  model  needs  to  be 
determined.  This  is  where  an  extension  of  Forman  and  Singpurwalla ' s techniques 
[37]  may  prove  useful. 

3.2  IMPORTANT  CONSIDERATIONS  IN  SOFTWARE  ERROR  DATA  COLLECTION 

In  designing  the  requirements  for  a software  error  data  collection  study 
certain  factors  have  a strong  impact  on  the  success  or  failure  of  the  project. 


Here  we  summarize  the  aspects  which  should  be  carefully  reviewed  prior  to  such 

an  investigation. 

(1)  Proper  planning  of  a project  is  of  utmost  importance.  The  exact  require- 
ments must  be  clearly  specified  prior  to  initiation  of  a data  collecting 
process.  We  are  not  recommending  any  one  specific  procedure,  however  the 
approaches  presented  on  pages  158-59  seem  to  be  adequate  for  modeling  type 
studies . 

(2)  The  error  reporting  forms  should  be  designed  to  be  minimal  in  length,  yet 
include  all  requirements,  and  infringe  on  the  programmers  (time  and  effort) 
as  little  as  possible. 

(3)  Recording  errors  bv  category  has  been  strongly  advocated  by  many  research- 
ers in  the  area  of  software  reliability.  This  appears  to  be  feasible  yet 
caution  must  be  exercised  so  that  onlv  "true"  software  errors  are  documented 
in  such  an  analysis.  Furthermore,  during  the  data  collection  process,  dupli- 
cate error  reports  must  be  ignored  so  as  to  reflect  the  actual  number  of 
errors  encountered. 

(A)  Collecting  error  data  in  terms  of  time- between-errors  definitely  appears 

to  be  the  best  procedure  in  that  such  data  can  be  utilized  with  models  re- 
quiring time-between-errors  and  with  those  models  needing  errors  per  interval. 
The  only  potential  problem  here  is  the  feasibility  of  collecting  data  in  the 
aforementioned  format. 

(5)  It  is  strongly  believed  that  using  CPU  time  as  the  time  reference  is  the 
most  attractive  measure.  Calendar  time  has  obvious  problems  associated 
with  it  which  were  discussed  previously  in  this  report. 

(6)  Finally,  when  data  is  being  collected,  sufficient  quantities  must  be  avail- 
able for  model  application.  Without  a good  sized  data  set  it  becomes  dif- 


ficult to  assess  the  efficiency  of  the  various  software  reliability  models. 


3.3  SUGGESTIONS  FOR  FUTURE  STUDY 


We  feel  that  the  following  items  are  important  enough  to  the  field  of 
software  reliability  to  warrant  further  study  and  research.  We  regret  that 
because  of  insufficient  time,  lack  of  the  necessary  inputs,  or  less  than  per- 
fect creativity  we  could  not  address  these  problems  ourselves. 

(1)  It  is  necessary  to  have  access  to  the  development  of  a medium  to  large 
scale  software  project  at  an  early  enough  stage  so  that  software  error 
data  can  be  collected  according  to  the  recommendations  in  Section  3.2. 

This  data  will  be  used  with  whatever  models  are  desirable  (we  recommend 
the  Geometric  or  Schneidewind ' s method  3,  depending  on  the  data  format) 
to  make  a reliability  prediction.  Then  the  software  package  needs  to  be 
monitored  for  a considerable  length  of  time  during  the  operational  phase 
to  collect  enough  data  to  produce  an  ultimate  validation  of  the  models' 
prediction. 

(2)  When  a new  module  is  integrated  into  the  software  or  when  the  software 
is  released  to  the  user,  there  is  sometimes  a sudden  jump  in  the  number 
of  errors  discovered.  This  sudden  jump  is  represented  as  some  kind  of 
bulge  in  the  graph  of  the  failure  rate.  More  needs  to  be  learned  about 
this  sudden  bulge,  namelv  its  shape,  height,  length,  and  to  where  the 
failure  rate  returns  after  the  bulge  has  passed.  In  other  words,  what 
is  the  relationship  between  the  failure  rate  before  and  after  such  a 
sudden  jump. 

(3)  One  of  the  points  that  makes  the  Schneidewind  method  3 model  so  attractive 
is  its  capability  of  determining  and  using  s,  the  optimum  point  at  which 
the  earlier  data  is  de-emphasized  relative  to  the  most  recent  data.  The 
strength  of  the  Geometr ic-Poisson  is  its  simplicity  of  use  and  the  meaning- 
fulness of  its  parameters  (in  fact,  Schneidewind  method  3 with  s=2  and  the 
Geometr ic-Poisson  are  numerically  equivalent).  Both  models  give  a very 
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good  data  fit.  Therefore  it  would  be  very  desirable  to  somehow  incorporate 
the  best  features  of  each  into  a single  model. 

(4)  After  a model  which  requires  errors  per  interval  makes  prediction  below 

1 error  per  interval,  the  results  are  difficult  to  interpret  clearly.  A 
more  desirable  form  for  the  predicitions  would  be  time  between  errors 
which  steadily  increases  as  the  software  improves.  Therefore,  it  would 
be  ideal  to  be  able  to  switch  from  an  errors  per  interval  form  to  a time 
between  errors  form  when  the  failure  rate  gets  very  low. 

(5)  Rudner  [119,120]  develops  some  very  interesting  and  promising  "tagging" 
techniques  which  need  to  be  validated  on  an  actual,  controlled  test 
situation  with  two  or  more  debuggers  working  on  a piece  of  software. 

(6)  Forman  and  Singpurwalla' s assurance  techniques  [37]  need  to  be  applied 
to  more  models,  especially  those  which  predict  N,  the  total  number  of 
errors.  So  far  they  have  only  been  applied  to  the  Jelinski-Moranda 
model. 

(7)  As  other  software  reliability  researchers  can  attest,  more  good  data 
sets  are  needed  for  this  kind  of  research.  Efforts  need  to  be  made  to 
carefully  collect  software  error  data  from  various  projects  in  sufficient 
quantity  and  with  sufficient  documentation,  and  these  data  sets  should 

be  made  available  to  those  who  are  exploring  software  reliability.  Rome 
Air  Development  Center  is  presently  addressing  this  issue,  and  we  en- 
courage their  efforts. 

*************** 

Software  reliability  continues  to  be  an  interesting  and  challenging  area  of 
study.  Yet  it  remains  a somewhat  illusive  attribute  and  is  difficult  to 
determine  and  predict  as  precisely  as  we  would  like.  But  this  is  not  to  say 
that  it  is  not  a very  necessary  and  significant  pursuit.  In  the  development 
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stage,  a knowledge  of  the  reliability  of  the  software  would  allow  for  more 
intelligent  decisions  as  to  the  allocation  of  money,  time,  and  effort.  In 
the  operational  stage,  accurate  estimates  of  software  reliability  not  only 
produce  useful  and  desirable  results,  but  also  aid  in  optimizing  the  utiliza- 
tion of  resources. 


APPENDIX  A 


COMPUTER  PROGRAM  SOFTW 

This  appendix  contains  a listing  of  a computer  program  called  SOFTW. 
This  program  automates  the  five  software  reliability  models  that  were  ini- 
tially thought  to  be  the  most  promising.  The  specific  models  are  the 
Jel insk i-Moranda  model,  the  extended  Je 1 inski-Moranda  model,  the  Geometric 
model,  the  Geometric-Poisson  model,  and  the  Schneidewind  model  (all  three 
methods).  The  program  is  written  in  FORTRAN,  and  it  runs  on  the  CDC  6600 
computer  at  Wright-Pat terson  AFB,  Dayton,  Ohio.  Both  the  maximum  likeli- 
hood estimation,  as  well  as  the  actual  model  solution  are  incorporated  in 
the  program. 
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